The idea is natural. LLMs can do everything else. They write code, summarize documents, answer questions, generate images. Why not point them at a database and ask "which customers will churn?"
People have tried. Research teams at Google, Meta, and dozens of startups have spent the last two years exploring LLMs on tabular data. The results are consistent: LLMs underperform purpose-built approaches, often by a wide margin.
On the RelBench benchmark (7 databases, 30 tasks, 103 million rows), Llama 3.2 3B scored 68.06 AUROC on classification tasks. KumoRFM, a foundation model designed for relational data, scored 76.71 AUROC zero-shot. A supervised graph neural network scored 75.83. Even LightGBM with manual features was competitive at 62.44 once you account for the quality of features.
This is not a scaling problem. It is an architectural mismatch. Here is why.
serialization_example
| customer_id | revenue | plan | created_at | is_active |
|---|---|---|---|---|
| 48291 | $15,847.32 | Enterprise | 2024-03-15 | true |
| 72104 | $2,340.00 | Basic | 2025-01-08 | true |
| 55893 | $891.50 | Pro | 2024-11-22 | false |
An LLM sees this as: "48291 | $15,847.32 | Enterprise | 2024-03-15 | true". The number 15847.32 becomes tokens ["15", "847", ".", "32"]. The model cannot natively understand that $15,847 is close to $16,000 but far from $1.58.
architecture_mismatch
| property | text (LLM native) | tabular data (actual) | consequence |
|---|---|---|---|
| Order | Sequential (word order matters) | Unordered (row shuffle = same data) | LLM assigns meaning to row position |
| Types | Uniform tokens from vocabulary | Mixed: int, float, categorical, datetime | Numerical reasoning brittle on tokens |
| Scale | Context window: 128K-1M tokens | Enterprise table: 10M+ rows | LLM sees 0.005% of data |
| Patterns | Sequential dependencies | Statistical relationships across rows | Wrong optimization objective |
| Structure | Flat sequence | Multi-table with foreign keys | No native relational representation |
Five architectural mismatches between LLMs and tabular data. Each one independently degrades performance. Together, they explain the 8.65 AUROC gap between Llama 3.2 3B (68.06) and KumoRFM (76.71).
The training objective mismatch
LLMs are trained to predict the next token in a sequence. This objective is brilliant for language. Language is sequential, each word depends on the words before it, and the training signal is rich (every token provides a gradient).
Tabular data is not sequential. It has three properties that make next-token prediction fundamentally wrong:
1. Row order is meaningless
In text, the sentence "the cat sat on the mat" means something different from "mat the on sat cat the." Word order carries semantic information, and LLMs learn to exploit it.
In a table, rows are unordered. Shuffling the rows of a customer table does not change any prediction. The model needs to be permutation-invariant with respect to rows. LLMs are not. They have positional embeddings that assign meaning to position, which means the same data presented in a different row order can produce different predictions.
2. Column types are heterogeneous
Text is a sequence of tokens from a fixed vocabulary. A table row might contain an integer (customer_id: 48291), a float (revenue: 15,847.32), a categorical (plan: enterprise), a timestamp (created_at: 2024-03-15 09:42:11), and a boolean (is_active: true).
Serializing these into text tokens destroys the type information. The number 15847.32 becomes the token sequence ["15", "847", ".", "32"], which the LLM treats as four separate pieces. It cannot natively understand that 15847.32 is close to 15900 but far from 1.58. Numerical reasoning on serialized text is brittle and unreliable.
3. Predictive patterns are statistical, not sequential
In text, the pattern is "given this sequence of words, what comes next?" In tabular data, the pattern is "given the statistical relationships across rows and columns, what is the value of this target variable?"
Predicting churn requires understanding that customers with declining login frequency AND increasing support tickets AND approaching contract renewal have a high churn probability. This is a multivariate statistical pattern across multiple tables. It is not a sequential completion task.
What the benchmarks show
The RelBench benchmark provides the most rigorous comparison available. Here is how different approaches perform on the same classification tasks:
| Approach | AUROC | Architecture |
|---|---|---|
| LightGBM + manual features | 62.44 | Gradient boosted trees on flat table |
| Llama 3.2 3B (text serialization) | 68.06 | LLM with tables serialized as text |
| Supervised GNN (RDL) | 75.83 | Graph neural network on relational graph |
| KumoRFM zero-shot | 76.71 | Pre-trained graph transformer |
| KumoRFM fine-tuned | 81.14 | Fine-tuned graph transformer |
The LLM (68.06) outperforms LightGBM with manual features (62.44), but this says more about the limitations of manual feature engineering than about the strength of LLMs. The LLM sees the raw data and captures some cross-column patterns that manual features miss. But it still falls 8.65 points short of KumoRFM zero-shot, which sees the same raw data through a graph-native architecture.
The serialization problem
To feed tabular data to an LLM, you have to serialize it as text. There are several approaches, and none of them work well.
Row-by-row serialization
Convert each row to a text string: "Customer 48291 has plan enterprise, revenue 15847.32, created on 2024-03-15, is active." The LLM processes each row as a text passage.
Problem: the model sees one row at a time. It cannot compare across rows (is 15847.32 high or low for this segment?) without seeing all rows simultaneously. Context windows cap at 128K to 1M tokens, but enterprise tables have millions of rows. You physically cannot fit the data into the context.
Table-as-markdown
Format the table as markdown with headers and pipes. This preserves column alignment and lets the model see multiple rows.
Problem: you can fit maybe 200-500 rows in a context window. For a table with 10 million rows, you are showing the model 0.005% of the data. Any statistical pattern that requires seeing the full distribution is invisible.
JSON serialization
Represent each row as a JSON object. This preserves column names and types better than markdown.
Problem: even more verbose than markdown. Fewer rows fit in the context window. And the fundamental issues (row order sensitivity, numerical reasoning brittleness) remain.
The multi-table problem
Everything above applies to a single table. Enterprise databases have 10 to 50 tables connected by foreign keys. The relational structure (customers → orders → products → reviews) carries critical predictive information.
multi-hop pattern (3 tables, invisible to LLM)
| hop | table | data | signal |
|---|---|---|---|
| 1 | customers | Customer 48291 placed 8 orders | Moderate activity |
| 2 | order_items | 6 of 8 orders included Product P-442 | Strong product affinity |
| 3 | reviews (by other customers) | P-442 avg rating dropped from 4.5 to 2.1 in 60 days | Product quality collapse |
| Signal | — | Customer 48291 is loyal to a product that is failing | Churn risk: HIGH |
Highlighted: the churn signal is at hop 3, where OTHER customers' reviews of the SAME product reveal a quality collapse. This requires: customers to orders to products to reviews. An LLM processing serialized rows from the customers table cannot reach this signal.
what the LLM actually sees (serialized text)
| input_format | text_sent_to_LLM | can_it_detect_the_signal? |
|---|---|---|
| Customer row | "48291 | Enterprise | $15,847.32 | 2024-03-15" | No (no order or product data) |
| + Order rows | "O-7823 | 48291 | P-442 | $89.00 | 2025-01-15" | Partial (sees product ID, not reviews) |
| + Review rows (all 4M) | Cannot fit: 4M reviews x ~80 tokens = 320M tokens | Impossible (exceeds any context window) |
To detect the multi-hop churn signal, the LLM would need the customer row, their 8 order rows, and the review history of product P-442 from OTHER customers. The 4M review table alone exceeds any context window by 100x.
LLMs have no mechanism for representing relational structure. You can serialize multiple tables as text, but the foreign key relationships become implicit references ("order 7823 references customer 48291") rather than structural connections. The model has to parse text to reconstruct relationships that a graph model represents natively.
Multi-hop patterns (a customer's churn depends on the return rates of products they bought, which depends on the manufacturer's quality metrics) require traversing 3-4 tables through foreign keys. In a graph, this is 3-4 hops of message passing. In serialized text, this requires the LLM to piece together scattered text references across multiple serialized tables. In practice, LLMs fail at this.
LLM on tabular data
- Next-token prediction objective
- Row-order dependent (positional embeddings)
- Serializes numbers as token sequences
- Cannot fit large tables in context window
- No native multi-table representation
Graph transformer (KumoRFM)
- Relational pattern learning objective
- Permutation-invariant over rows
- Native numerical and categorical encoding
- Processes millions of rows as graph structure
- Multi-table relationships as edges in the graph
serialization_methods_compared
| method | tokens_per_row | rows_in_128K_context | % of 10M table | preserves_types |
|---|---|---|---|---|
| Row-by-row text | ~80 | ~1,600 | 0.016% | No |
| Markdown table | ~50 | ~2,500 | 0.025% | Partial |
| JSON objects | ~120 | ~1,000 | 0.010% | Better |
| CSV format | ~30 | ~4,200 | 0.042% | No |
| Graph representation | N/A | All 10M rows | 100% | Yes (native encoding) |
Highlighted: graph-based representations process all rows because they do not serialize data into text tokens. They encode numerical values, categorical types, and relationships natively.
PQL Query
PREDICT churn FOR EACH customers.customer_id WHERE customers.is_active = true
Compare what an LLM and a graph transformer see for the same prediction. The LLM serializes 0.025% of rows as text. The graph transformer sees all rows, all tables, all foreign key relationships, with native type encoding.
Output
| customer_id | LLM_prediction | KumoRFM_prediction | actual | delta |
|---|---|---|---|---|
| 48291 | 0.35 | 0.12 | No churn | FM correct |
| 72104 | 0.51 | 0.84 | Churned | FM correct |
| 55893 | 0.44 | 0.91 | Churned | FM correct, LLM missed |
| 63017 | 0.62 | 0.07 | No churn | FM correct, LLM false alarm |
What works instead
The approaches that work well on structured data share a common property: they match the model architecture to the data structure.
Gradient boosted trees (single-table)
For flat, single-table data with pre-engineered features, XGBoost and LightGBM remain strong. They handle heterogeneous column types natively, are invariant to feature scaling, and learn non-linear relationships through decision splits. On Kaggle tabular benchmarks, they consistently outperform LLMs.
Limitation: they require a flat feature table, so the feature engineering bottleneck remains for multi-table data.
Graph neural networks (multi-table)
GNNs represent the relational database as a graph and learn patterns through message passing. This is architecturally correct: the model structure matches the data structure. On RelBench, supervised GNNs score 75.83 AUROC, outperforming both LightGBM (62.44) and Llama 3.2 3B (68.06).
Limitation: you need to train a new GNN for each prediction task.
Relational foundation models
KumoRFM combines the graph-native architecture of GNNs with the pre-training paradigm of foundation models. It is trained on thousands of diverse relational databases, learning universal patterns that transfer to new data. At inference time, it delivers predictions from raw relational data without task-specific training.
This achieves 76.71 AUROC zero-shot, outperforming both LLMs and supervised GNNs. Fine-tuning pushes accuracy to 81.14.
When LLMs do help with data
LLMs are not useless in the data ecosystem. They are just wrong for the specific task of making predictions on structured data. They excel at adjacent tasks:
- Natural language interfaces. Translating business questions ("which customers are at risk of churning?") into structured queries or PQL statements.
- Data documentation. Generating descriptions of tables, columns, and data dictionaries from schema inspection.
- Result interpretation. Explaining predictions in natural language for business stakeholders who do not read AUROC scores.
- Code generation. Writing SQL, Python, or PQL queries from natural language descriptions.
The right architecture for predictions on structured data is one that was designed for structured data. LLMs were designed for language. Use each where it fits.
The takeaway
The instinct to throw LLMs at every problem is understandable. They are the most capable general-purpose AI tools ever built. But "general-purpose" does not mean "optimal for every purpose."
Structured relational data has specific properties (unordered rows, heterogeneous types, multi-table relationships, temporal dynamics) that require specific architectural choices. Graph transformers, pre-trained on relational data, match these properties. LLMs do not.
The 8.65 AUROC gap between Llama 3.2 3B (68.06) and KumoRFM (76.71) is not a tuning problem. It is a structural mismatch between the training objective and the task. Scaling the LLM larger will narrow the gap, but it will not close it, because the architecture is solving the wrong problem.
If your data lives in relational tables and you want accurate predictions, use a model that was built to read relational tables.