What does it mean to flatten relational data?

Flattening relational data means joining multiple database tables into a single flat table with one row per entity. For example, to predict customer churn, you join the customers, orders, products, support tickets, and payments tables into one wide table where each row represents a customer. This requires writing SQL joins and aggregations (like avg_order_value or total_support_tickets) to collapse related rows into scalar features. The process discards graph topology, temporal sequences, multi-hop relationships, and cardinality information that existed in the original relational structure.

Why do most ML tools require flat tables?

Most ML algorithms - XGBoost, LightGBM, random forests, logistic regression, neural networks, and even newer tabular foundation models like TabPFN - are designed to process fixed-width feature vectors. Each training example must be a single row with a fixed number of columns. Relational databases have variable-length relationships (a customer may have 3 orders or 300), hierarchical structure, and temporal dynamics that do not fit into a fixed-width row. Flattening is the workaround that forces relational data into the format these models require.

How much accuracy is lost by flattening?

On the RelBench benchmark (7 databases, 30 tasks, 103 million rows), LightGBM on manually engineered flat features scores 62.44 AUROC. KumoRFM on the original relational structure scores 76.71 zero-shot and 81.14 fine-tuned. That is a 14-19 AUROC point gap. On specific tasks, the gap is even larger: rel-stack user-engagement goes from 63.39 (flat) to 90.59 (relational), a 43% relative improvement. The lost signal comes from multi-hop patterns, temporal sequences, and graph topology that flattening destroys.

Can better feature engineering close the gap?

Only partially. A database with 5 tables and 50 columns has 1,200+ first-order features, 719,400+ pairwise interactions, and ~8,000 multi-hop features. Human data scientists typically build 50-200 features, covering 4-17% of the feature space. Even the most experienced team will miss the majority of predictive patterns. The combinatorial explosion makes exhaustive manual feature engineering practically impossible. A relational foundation model explores the full feature space automatically.

Do tabular foundation models like TabPFN solve this problem?

No. TabPFN, Fundamental, and other tabular foundation models are designed for flat tables. They accept a single table as input and learn patterns within that table. The flattening step happens before these models ever see the data. They may be better than XGBoost at learning from the features you give them, but they cannot recover signals that were destroyed during flattening. The problem is not the model architecture - it is the data representation.

How does KumoRFM avoid flattening?

KumoRFM represents the entire relational database as a temporal heterogeneous graph. Each row in each table becomes a node. Each foreign key relationship becomes an edge. Timestamps are preserved as temporal attributes. A graph transformer processes this structure by passing messages along edges, learning which cross-table patterns are predictive. Multi-hop patterns (customer to orders to products to returns) are captured naturally because information propagates through the graph layer by layer, without any flattening or manual feature engineering.

Why Flattening Relational Data Kills ML Accuracy | Kumo.ai

Every enterprise database is relational. Customers link to orders. Orders link to products. Products link to reviews. Reviews link back to other customers. The data lives in 5, 10, sometimes 50 interconnected tables with foreign keys, timestamps, and hierarchical relationships.

And every ML tool - XGBoost, LightGBM, random forests, neural networks, AutoML platforms, even the newest tabular foundation models - requires you to collapse all of that structure into a single flat table before it can make a prediction.

This flattening step is so ubiquitous that most data scientists do not question it. It is just "how ML works." You write SQL joins, compute aggregations, build a feature table with one row per entity, and feed it to a model. The entire field of feature engineering exists to make this flattening step less lossy.

But flattening is inherently lossy. And the signal it destroys is precisely the signal that separates good predictions from great ones.

The headline result: SAP SALT benchmark

The SAP SALT benchmark is an enterprise-grade evaluation where real business analysts and data scientists attempt prediction tasks on SAP enterprise data. It measures how accurately different approaches predict real business outcomes on production-quality enterprise databases with multiple related tables.

sap_salt_enterprise_benchmark

approach	accuracy	what_it_means
LLM + AutoML	63%	Language model generates features, AutoML selects model
PhD Data Scientist + XGBoost	75%	Expert spends weeks hand-crafting features, tunes XGBoost
KumoRFM (zero-shot)	91%	No feature engineering, no training, reads relational tables directly

SAP SALT benchmark: KumoRFM outperforms expert data scientists by 16 percentage points and LLM+AutoML by 28 percentage points on real enterprise prediction tasks.

KumoRFM scores 91% where PhD-level data scientists with weeks of feature engineering and hand-tuned XGBoost score 75%. The 16 percentage point gap is the value of reading relational data natively instead of flattening it into a single table.

What flattening destroys

When you collapse a relational database into a flat table, you lose at least six categories of predictive signal that cannot be recovered by any downstream model, no matter how sophisticated.

signal_types_destroyed_by_flattening

signal_type	what_it_captures	example	flat_table_substitute
Multi-hop relationships	Patterns across 3+ connected tables	customer → orders → products → reviews → similar customers	None. Joins typically stop at 1-2 hops.
Temporal sequences across tables	Activity progression patterns over time	Login → Browse → Add to cart → Abandon → Support ticket (in order)	Scalar aggregates: pages_viewed=22, cart_abandons=3
Graph topology	Structural patterns like rings, clusters, hubs	A → B → C → D → A (fraud ring), social clusters	Invisible. Single-row features cannot represent cycles.
Entity-level aggregation context	How an entity relates to its full neighborhood	A customer’s merchant diversity (50 unique merchants vs. 3)	A single count: num_merchants=50. Context lost.
Cross-table interaction effects	Correlations between events in different tables	Product returns × support tickets × review sentiment	Requires pre-computed interaction features. Rarely built.
Cardinality information	How many related entities exist and their distribution	Lead has 4 contacts from 3 departments (multi-threaded)	contact_count=4. Department spread gone.
Temporal decay patterns	Recency-weighted importance of related events	Recent orders matter more than old ones for churn	avg_order_value (treats all orders equally)
Heterogeneous relationship types	Different edge types carry different meaning	purchased vs. returned vs. reviewed vs. wishlisted	All collapsed into generic aggregates

Highlighted: the top three signal types - multi-hop relationships, temporal sequences, and graph topology - are the most common sources of large accuracy gaps between flat and relational approaches.

Concrete example: Lead scoring

Consider Lead L-302 in a B2B CRM. The relational database contains rich, multi-table context about this lead:

4 contacts from 3 departments are active on the account - a multi-threaded buying committee, which is the strongest predictor of enterprise deal closure.
Content progression: Blog → Case study → API docs → Demo request. This is a textbook buying journey from awareness to evaluation.
Similar account closed $210K last quarter. The account-similarity signal comes from matching company attributes and engagement patterns across the opportunities table.
Company raised Series B 30 days ago. Firmographic momentum from the accounts table indicates budget availability.

lead_L-302_relational_vs_flat

data_source	relational_signal	flat_table_value
contacts table	4 contacts from 3 departments (multi-threaded)	emails_opened=4
activities table	Blog → Case study → API docs → Demo (buying progression)	pages_viewed=22
opportunities table	Similar account closed $210K last quarter	Not captured
accounts table	Company raised Series B 30 days ago	company_size=200

Every relational signal that makes L-302 a strong lead is destroyed or reduced to a meaningless scalar in the flat table. The model sees emails_opened=4 and pages_viewed=22, not a multi-threaded buying committee with a textbook content progression.

A flat-table model sees: emails_opened=4, pages_viewed=22, company_size=200. It has no way to know that those 4 emails came from 3 different departments, that the 22 page views followed a specific buying-stage progression, or that a similar account just closed a $210K deal. All of the signal that makes this a high-value lead is invisible.

Concrete example: Fraud detection

Account A sends $500 to Account B. Account B sends $480 to Account C. Account C sends $460 to Account D. Account D sends $440 back to Account A. Each individual transaction looks perfectly normal - a modest transfer between two accounts.

But the pattern is a fraud ring: A → B → C → D → A. Money is cycling through four accounts, with small amounts skimmed at each step. This circular flow is a classic money laundering pattern, and it is only visible when you can see the graph structure of transactions.

When you flatten the transaction data into a single row per transaction, each row contains: sender_id, receiver_id, amount, timestamp. There is no column for "this transaction is part of a four-node cycle." The ring is invisible. No amount of feature engineering on a single transaction row can recover the circular topology - because the pattern exists across four rows, not within one.

Concrete example: Churn prediction

Member Bob visits his gym 4 times per week on average. Over the last 6 weeks, his visit frequency dropped 68% - from 4 visits to 1.3. That alone is a churn signal. But the relational data reveals more:

Bob's workout buddies are also churning. Two of his three regular workout partners have cancelled in the last month. Social churn is contagious - when your peers leave, you are far more likely to leave.
Bob downgraded his plan from Premium to Basic last billing cycle, reducing his monthly payment from $79 to $29. This is a leading indicator of full cancellation.
Bob stopped attending group classes. His class attendance went from 3 per week to 0. Group class members have 2.3x higher retention, so losing this engagement channel is significant.

When flattened: visit_frequency=1.3, plan_type=Basic, monthly_spend=$29. The social churn signal - the fact that Bob's friends are leaving - disappears entirely. The peer behavior pattern exists in the relationships between members, not in any single member's row.

The accuracy gap: RelBench results

The destruction is not theoretical. The RelBench benchmark (7 databases, 30 tasks, 103 million rows) measures exactly how much signal is lost when you flatten relational data versus processing it in its native structure.

AUROC (Area Under the Receiver Operating Characteristic curve) measures how well a model distinguishes between positive and negative outcomes. An AUROC of 50 means random guessing, 100 means perfect prediction. Moving from 65 to 77 AUROC means the model correctly ranks a true positive above a true negative 77% of the time instead of 65%.

relbench_accuracy_comparison

approach	AUROC (classification)	gap_vs_flat	what_it_processes
LightGBM on flattened features	62.44	Baseline	Flat table (manual features)
XGBoost on flattened features	~63-64	+1-2 pts	Flat table (manual features)
KumoRFM zero-shot	76.71	+14.27 pts	Full relational structure
KumoRFM fine-tuned	81.14	+18.70 pts	Full relational structure + task adaptation

Highlighted: the 14-19 AUROC point gap between flat-table and relational approaches. This gap represents the predictive signal destroyed by flattening.

tasks_with_largest_accuracy_gaps

task	LightGBM_flat	KumoRFM_finetuned	absolute_gap	relative_improvement
rel-stack user-engagement	63.39	90.59	+27.20 pts	43%
rel-hm item-sales	57.12	78.84	+21.72 pts	38%
rel-avito ad-click	59.21	77.93	+18.72 pts	32%
rel-f1 driver-position	64.88	81.02	+16.14 pts	25%
rel-event user-attendance	61.45	76.71	+15.26 pts	25%

Highlighted: rel-stack user-engagement shows a 27+ AUROC point gap - the largest in the benchmark. User engagement patterns are deeply relational (users \u2192 posts \u2192 comments \u2192 votes \u2192 tags), and flattening destroys the interaction graph.

The gap is not uniform. Tasks that depend heavily on multi-hop relationships, temporal sequences, and graph structure show the largest gaps. Tasks that are well-served by simple aggregations (count, sum, mean) show smaller gaps. But in no case does the flat approach match the relational approach.

The feature space math

The accuracy gap has a mathematical explanation. Consider a modest enterprise database with 5 tables and 50 columns per table.

feature_space_coverage

feature_type	count	human_engineers_build	coverage
First-order features (single column aggregations)	1,200+	40-80	3-7%
Pairwise interaction features	719,400+	10-50	~0.01%
Multi-hop features (2+ relationship hops)	~8,000+	0-20	0-0.25%
Temporal window variants (7d, 30d, 90d)	3x multiplier on all above	20-50 windows	~1%
Total explorable feature space	~2.2 million+	50-200 features	4-17%

Highlighted: human data scientists explore 4-17% of the possible feature space. The remaining 83-96% is unexamined signal that a relational foundation model can access automatically.

This is not a criticism of data scientists. No human can enumerate 2.2 million features and test them for predictive value. The combinatorial space is too large. Data scientists use domain knowledge to build the 50-200 features they believe matter most. But domain knowledge is biased toward obvious signals and misses subtle multi-hop interactions.

A foundation model that reads the relational structure directly does not enumerate features at all. It learns a continuous representation of the entire relational neighborhood around each entity, implicitly capturing all of the patterns that exist in the data - including the 83-96% of the feature space that human engineers never explore.

Why TabPFN and Fundamental do not solve this

TabPFN (from the University of Freiburg) and Fundamental are tabular foundation models - pre-trained models designed for tabular data. They represent genuine advances in model architecture. On single-table benchmarks, they often match or beat well-tuned XGBoost and LightGBM ensembles.

But they are still tabular models. Their input is a flat table with one row per entity and a fixed number of columns. The flattening step - the SQL joins, the aggregations, the lossy collapse of relational structure into scalar features - happens before TabPFN or Fundamental ever sees the data.

Think of it this way: TabPFN is a better lens for looking at a photograph. KumoRFM is a better camera that captures more of the scene. No amount of lens improvement can recover detail that was never captured in the photograph.

tabular_fm_vs_relational_fm

capability	TabPFN / Fundamental	KumoRFM
Input format	Single flat table	Multiple relational tables
Handles multi-table joins	No (requires pre-flattening)	Yes (reads foreign keys directly)
Multi-hop pattern discovery	No	Yes (graph message passing)
Temporal sequence preservation	No (static features only)	Yes (timestamps on nodes and edges)
Graph topology awareness	No	Yes (heterogeneous graph transformer)
Pre-training data	Single tables from OpenML	Thousands of relational databases
The flattening problem	Not addressed	Eliminated entirely

Highlighted: tabular foundation models do not address the flattening problem. They improve what happens after flattening. Relational foundation models eliminate the need to flatten in the first place.

How KumoRFM avoids flattening

KumoRFM takes a fundamentally different approach. Instead of requiring a flat table, it represents the entire relational database as a temporal heterogeneous graph:

Each row in each table becomes a node. A customer row is a customer node. An order row is an order node. A product row is a product node.
Each foreign key becomes an edge. The customer_id foreign key in the orders table creates an edge from each order node to its customer node. The product_id foreign key creates an edge from each order to its product.
Timestamps are preserved. Every node and edge carries its temporal information. The model can distinguish between a customer who placed 10 orders last week and a customer who placed 10 orders over 3 years.
A graph transformer processes the full structure. Information propagates through the graph via message passing. After 3 layers, each node has aggregated context from entities up to 3 hops away - capturing exactly the multi-hop patterns that flattening destroys.

PQL Query

PREDICT churn_90d
FOR EACH members.member_id
WHERE members.status = 'active'

One query replaces the entire flatten-and-model pipeline. KumoRFM reads the members, visits, classes, payments, and social connections tables directly. It discovers that Bob's workout buddies are churning, his visit frequency is declining, and he downgraded his plan - without any feature engineering.

Output

member_id	churn_prob	approach_comparison	key_signal
M-4401 (Bob)	0.89 (relational)	0.54 (flat)	Peer churn + frequency drop
M-4402	0.12 (relational)	0.18 (flat)	Stable peers, increasing visits
M-4403	0.71 (relational)	0.41 (flat)	Class dropout + plan downgrade
M-4404	0.06 (relational)	0.09 (flat)	High engagement, no risk signals

The bottom line: flattening is the bottleneck

The ML industry has spent a decade building better models for flat tables: XGBoost, LightGBM, CatBoost, TabPFN, Fundamental, AutoML ensembles. Each new model squeezes another 1-3 AUROC points out of the same flat feature table. Meanwhile, the gap between flat features and full relational data is 14-19 points.

The bottleneck was never the model. It was always the data representation. Flattening a relational database into a single table destroys the very patterns that differentiate accurate predictions from mediocre ones: multi-hop relationships, temporal sequences, graph topology, and cross-table interactions.

The solution is not better feature engineering. The solution is not trying harder to flatten without losing signal. The solution is eliminating the flattening step entirely - reading relational data in its native structure, the way it was designed to be stored.

That is what relational foundation models do. And the 14-19 AUROC point improvement is the signal that flattening was destroying all along.

Key Takeaways

1Flattening relational data into a single table destroys at least 6 categories of predictive signal: multi-hop relationships, temporal sequences, graph topology, entity-level aggregation context, cross-table interactions, and cardinality information.
2On RelBench, the accuracy gap between flat-table ML (62.44 AUROC) and relational ML (81.14 AUROC fine-tuned) is 14-19 AUROC points. On some tasks, the gap exceeds 27 points.
3The feature space math explains why: 5 tables with 50 columns produce 2.2 million+ possible features. Human engineers build 50-200, covering 4-17% of the space. The other 83-96% is lost to flattening.
4Tabular foundation models (TabPFN, Fundamental) do not solve this problem. They still require a flat table as input. The flattening happens before they see the data.
5KumoRFM eliminates flattening by representing the database as a temporal heterogeneous graph, preserving the full relational structure that flat-table models cannot access.

Why Flattening Relational Data Kills ML Accuracy