Every enterprise database is relational. Customers link to orders. Orders link to products. Products link to reviews. Reviews link back to other customers. The data lives in 5, 10, sometimes 50 interconnected tables with foreign keys, timestamps, and hierarchical relationships.
And every ML tool - XGBoost, LightGBM, random forests, neural networks, AutoML platforms, even the newest tabular foundation models - requires you to collapse all of that structure into a single flat table before it can make a prediction.
This flattening step is so ubiquitous that most data scientists do not question it. It is just "how ML works." You write SQL joins, compute aggregations, build a feature table with one row per entity, and feed it to a model. The entire field of feature engineering exists to make this flattening step less lossy.
But flattening is inherently lossy. And the signal it destroys is precisely the signal that separates good predictions from great ones.
The headline result: SAP SALT benchmark
The SAP SALT benchmark is an enterprise-grade evaluation where real business analysts and data scientists attempt prediction tasks on SAP enterprise data. It measures how accurately different approaches predict real business outcomes on production-quality enterprise databases with multiple related tables.
sap_salt_enterprise_benchmark
| approach | accuracy | what_it_means |
|---|---|---|
| LLM + AutoML | 63% | Language model generates features, AutoML selects model |
| PhD Data Scientist + XGBoost | 75% | Expert spends weeks hand-crafting features, tunes XGBoost |
| KumoRFM (zero-shot) | 91% | No feature engineering, no training, reads relational tables directly |
SAP SALT benchmark: KumoRFM outperforms expert data scientists by 16 percentage points and LLM+AutoML by 28 percentage points on real enterprise prediction tasks.
KumoRFM scores 91% where PhD-level data scientists with weeks of feature engineering and hand-tuned XGBoost score 75%. The 16 percentage point gap is the value of reading relational data natively instead of flattening it into a single table.
What flattening destroys
When you collapse a relational database into a flat table, you lose at least six categories of predictive signal that cannot be recovered by any downstream model, no matter how sophisticated.
signal_types_destroyed_by_flattening
| signal_type | what_it_captures | example | flat_table_substitute |
|---|---|---|---|
| Multi-hop relationships | Patterns across 3+ connected tables | customer → orders → products → reviews → similar customers | None. Joins typically stop at 1-2 hops. |
| Temporal sequences across tables | Activity progression patterns over time | Login → Browse → Add to cart → Abandon → Support ticket (in order) | Scalar aggregates: pages_viewed=22, cart_abandons=3 |
| Graph topology | Structural patterns like rings, clusters, hubs | A → B → C → D → A (fraud ring), social clusters | Invisible. Single-row features cannot represent cycles. |
| Entity-level aggregation context | How an entity relates to its full neighborhood | A customer’s merchant diversity (50 unique merchants vs. 3) | A single count: num_merchants=50. Context lost. |
| Cross-table interaction effects | Correlations between events in different tables | Product returns × support tickets × review sentiment | Requires pre-computed interaction features. Rarely built. |
| Cardinality information | How many related entities exist and their distribution | Lead has 4 contacts from 3 departments (multi-threaded) | contact_count=4. Department spread gone. |
| Temporal decay patterns | Recency-weighted importance of related events | Recent orders matter more than old ones for churn | avg_order_value (treats all orders equally) |
| Heterogeneous relationship types | Different edge types carry different meaning | purchased vs. returned vs. reviewed vs. wishlisted | All collapsed into generic aggregates |
Highlighted: the top three signal types - multi-hop relationships, temporal sequences, and graph topology - are the most common sources of large accuracy gaps between flat and relational approaches.
Concrete example: Lead scoring
Consider Lead L-302 in a B2B CRM. The relational database contains rich, multi-table context about this lead:
- 4 contacts from 3 departments are active on the account - a multi-threaded buying committee, which is the strongest predictor of enterprise deal closure.
- Content progression: Blog → Case study → API docs → Demo request. This is a textbook buying journey from awareness to evaluation.
- Similar account closed $210K last quarter. The account-similarity signal comes from matching company attributes and engagement patterns across the opportunities table.
- Company raised Series B 30 days ago. Firmographic momentum from the accounts table indicates budget availability.
lead_L-302_relational_vs_flat
| data_source | relational_signal | flat_table_value |
|---|---|---|
| contacts table | 4 contacts from 3 departments (multi-threaded) | emails_opened=4 |
| activities table | Blog → Case study → API docs → Demo (buying progression) | pages_viewed=22 |
| opportunities table | Similar account closed $210K last quarter | Not captured |
| accounts table | Company raised Series B 30 days ago | company_size=200 |
Every relational signal that makes L-302 a strong lead is destroyed or reduced to a meaningless scalar in the flat table. The model sees emails_opened=4 and pages_viewed=22, not a multi-threaded buying committee with a textbook content progression.
A flat-table model sees: emails_opened=4, pages_viewed=22, company_size=200. It has no way to know that those 4 emails came from 3 different departments, that the 22 page views followed a specific buying-stage progression, or that a similar account just closed a $210K deal. All of the signal that makes this a high-value lead is invisible.
Concrete example: Fraud detection
Account A sends $500 to Account B. Account B sends $480 to Account C. Account C sends $460 to Account D. Account D sends $440 back to Account A. Each individual transaction looks perfectly normal - a modest transfer between two accounts.
But the pattern is a fraud ring: A → B → C → D → A. Money is cycling through four accounts, with small amounts skimmed at each step. This circular flow is a classic money laundering pattern, and it is only visible when you can see the graph structure of transactions.
When you flatten the transaction data into a single row per transaction, each row contains: sender_id, receiver_id, amount, timestamp. There is no column for "this transaction is part of a four-node cycle." The ring is invisible. No amount of feature engineering on a single transaction row can recover the circular topology - because the pattern exists across four rows, not within one.
Concrete example: Churn prediction
Member Bob visits his gym 4 times per week on average. Over the last 6 weeks, his visit frequency dropped 68% - from 4 visits to 1.3. That alone is a churn signal. But the relational data reveals more:
- Bob's workout buddies are also churning. Two of his three regular workout partners have cancelled in the last month. Social churn is contagious - when your peers leave, you are far more likely to leave.
- Bob downgraded his plan from Premium to Basic last billing cycle, reducing his monthly payment from $79 to $29. This is a leading indicator of full cancellation.
- Bob stopped attending group classes. His class attendance went from 3 per week to 0. Group class members have 2.3x higher retention, so losing this engagement channel is significant.
When flattened: visit_frequency=1.3, plan_type=Basic, monthly_spend=$29. The social churn signal - the fact that Bob's friends are leaving - disappears entirely. The peer behavior pattern exists in the relationships between members, not in any single member's row.
The accuracy gap: RelBench results
The destruction is not theoretical. The RelBench benchmark (7 databases, 30 tasks, 103 million rows) measures exactly how much signal is lost when you flatten relational data versus processing it in its native structure.
AUROC (Area Under the Receiver Operating Characteristic curve) measures how well a model distinguishes between positive and negative outcomes. An AUROC of 50 means random guessing, 100 means perfect prediction. Moving from 65 to 77 AUROC means the model correctly ranks a true positive above a true negative 77% of the time instead of 65%.
relbench_accuracy_comparison
| approach | AUROC (classification) | gap_vs_flat | what_it_processes |
|---|---|---|---|
| LightGBM on flattened features | 62.44 | Baseline | Flat table (manual features) |
| XGBoost on flattened features | ~63-64 | +1-2 pts | Flat table (manual features) |
| KumoRFM zero-shot | 76.71 | +14.27 pts | Full relational structure |
| KumoRFM fine-tuned | 81.14 | +18.70 pts | Full relational structure + task adaptation |
Highlighted: the 14-19 AUROC point gap between flat-table and relational approaches. This gap represents the predictive signal destroyed by flattening.
tasks_with_largest_accuracy_gaps
| task | LightGBM_flat | KumoRFM_finetuned | absolute_gap | relative_improvement |
|---|---|---|---|---|
| rel-stack user-engagement | 63.39 | 90.59 | +27.20 pts | 43% |
| rel-hm item-sales | 57.12 | 78.84 | +21.72 pts | 38% |
| rel-avito ad-click | 59.21 | 77.93 | +18.72 pts | 32% |
| rel-f1 driver-position | 64.88 | 81.02 | +16.14 pts | 25% |
| rel-event user-attendance | 61.45 | 76.71 | +15.26 pts | 25% |
Highlighted: rel-stack user-engagement shows a 27+ AUROC point gap - the largest in the benchmark. User engagement patterns are deeply relational (users \u2192 posts \u2192 comments \u2192 votes \u2192 tags), and flattening destroys the interaction graph.
The gap is not uniform. Tasks that depend heavily on multi-hop relationships, temporal sequences, and graph structure show the largest gaps. Tasks that are well-served by simple aggregations (count, sum, mean) show smaller gaps. But in no case does the flat approach match the relational approach.
The feature space math
The accuracy gap has a mathematical explanation. Consider a modest enterprise database with 5 tables and 50 columns per table.
feature_space_coverage
| feature_type | count | human_engineers_build | coverage |
|---|---|---|---|
| First-order features (single column aggregations) | 1,200+ | 40-80 | 3-7% |
| Pairwise interaction features | 719,400+ | 10-50 | ~0.01% |
| Multi-hop features (2+ relationship hops) | ~8,000+ | 0-20 | 0-0.25% |
| Temporal window variants (7d, 30d, 90d) | 3x multiplier on all above | 20-50 windows | ~1% |
| Total explorable feature space | ~2.2 million+ | 50-200 features | 4-17% |
Highlighted: human data scientists explore 4-17% of the possible feature space. The remaining 83-96% is unexamined signal that a relational foundation model can access automatically.
This is not a criticism of data scientists. No human can enumerate 2.2 million features and test them for predictive value. The combinatorial space is too large. Data scientists use domain knowledge to build the 50-200 features they believe matter most. But domain knowledge is biased toward obvious signals and misses subtle multi-hop interactions.
A foundation model that reads the relational structure directly does not enumerate features at all. It learns a continuous representation of the entire relational neighborhood around each entity, implicitly capturing all of the patterns that exist in the data - including the 83-96% of the feature space that human engineers never explore.
Why TabPFN and Fundamental do not solve this
TabPFN (from the University of Freiburg) and Fundamental are tabular foundation models - pre-trained models designed for tabular data. They represent genuine advances in model architecture. On single-table benchmarks, they often match or beat well-tuned XGBoost and LightGBM ensembles.
But they are still tabular models. Their input is a flat table with one row per entity and a fixed number of columns. The flattening step - the SQL joins, the aggregations, the lossy collapse of relational structure into scalar features - happens before TabPFN or Fundamental ever sees the data.
Think of it this way: TabPFN is a better lens for looking at a photograph. KumoRFM is a better camera that captures more of the scene. No amount of lens improvement can recover detail that was never captured in the photograph.
tabular_fm_vs_relational_fm
| capability | TabPFN / Fundamental | KumoRFM |
|---|---|---|
| Input format | Single flat table | Multiple relational tables |
| Handles multi-table joins | No (requires pre-flattening) | Yes (reads foreign keys directly) |
| Multi-hop pattern discovery | No | Yes (graph message passing) |
| Temporal sequence preservation | No (static features only) | Yes (timestamps on nodes and edges) |
| Graph topology awareness | No | Yes (heterogeneous graph transformer) |
| Pre-training data | Single tables from OpenML | Thousands of relational databases |
| The flattening problem | Not addressed | Eliminated entirely |
Highlighted: tabular foundation models do not address the flattening problem. They improve what happens after flattening. Relational foundation models eliminate the need to flatten in the first place.
How KumoRFM avoids flattening
KumoRFM takes a fundamentally different approach. Instead of requiring a flat table, it represents the entire relational database as a temporal heterogeneous graph:
- Each row in each table becomes a node. A customer row is a customer node. An order row is an order node. A product row is a product node.
- Each foreign key becomes an edge. The customer_id foreign key in the orders table creates an edge from each order node to its customer node. The product_id foreign key creates an edge from each order to its product.
- Timestamps are preserved. Every node and edge carries its temporal information. The model can distinguish between a customer who placed 10 orders last week and a customer who placed 10 orders over 3 years.
- A graph transformer processes the full structure. Information propagates through the graph via message passing. After 3 layers, each node has aggregated context from entities up to 3 hops away - capturing exactly the multi-hop patterns that flattening destroys.
PQL Query
PREDICT churn_90d FOR EACH members.member_id WHERE members.status = 'active'
One query replaces the entire flatten-and-model pipeline. KumoRFM reads the members, visits, classes, payments, and social connections tables directly. It discovers that Bob's workout buddies are churning, his visit frequency is declining, and he downgraded his plan - without any feature engineering.
Output
| member_id | churn_prob | approach_comparison | key_signal |
|---|---|---|---|
| M-4401 (Bob) | 0.89 (relational) | 0.54 (flat) | Peer churn + frequency drop |
| M-4402 | 0.12 (relational) | 0.18 (flat) | Stable peers, increasing visits |
| M-4403 | 0.71 (relational) | 0.41 (flat) | Class dropout + plan downgrade |
| M-4404 | 0.06 (relational) | 0.09 (flat) | High engagement, no risk signals |
The bottom line: flattening is the bottleneck
The ML industry has spent a decade building better models for flat tables: XGBoost, LightGBM, CatBoost, TabPFN, Fundamental, AutoML ensembles. Each new model squeezes another 1-3 AUROC points out of the same flat feature table. Meanwhile, the gap between flat features and full relational data is 14-19 points.
The bottleneck was never the model. It was always the data representation. Flattening a relational database into a single table destroys the very patterns that differentiate accurate predictions from mediocre ones: multi-hop relationships, temporal sequences, graph topology, and cross-table interactions.
The solution is not better feature engineering. The solution is not trying harder to flatten without losing signal. The solution is eliminating the flattening step entirely - reading relational data in its native structure, the way it was designed to be stored.
That is what relational foundation models do. And the 14-19 AUROC point improvement is the signal that flattening was destroying all along.