Is TabPFN better than XGBoost for structured data predictions?

It depends on dataset size and structure. On single flat tables with fewer than 10,000 rows, TabPFN typically matches or beats tuned XGBoost with zero hyperparameter tuning. On larger single-table datasets, XGBoost remains the stronger choice. However, both models are limited to single flat tables. Enterprise prediction tasks typically involve 5-50 relational tables, and on relational data, a relational foundation model like KumoRFM outperforms both by 15-20+ AUROC points because it reads multi-table structure natively instead of requiring lossy flattening.

When should I use TabPFN instead of XGBoost?

Use TabPFN when your data fits in a single flat table with fewer than 10,000 rows and you want strong accuracy without any hyperparameter tuning. TabPFN is ideal for rapid prototyping, quick baselines, and small-data classification or regression tasks. It provides predictions in approximately 2.8 seconds with zero training and built-in uncertainty quantification.

What is the best model for structured data predictions in enterprise settings?

For enterprise prediction tasks, the best model depends on your data structure. If your data is a single flat table, TabPFN (small data) or XGBoost (large data) are strong choices. If your data spans multiple relational tables - which is the case for most enterprise tasks like churn prediction, fraud detection, and lead scoring - a relational foundation model like KumoRFM significantly outperforms both. On the SAP SALT enterprise benchmark, KumoRFM scored 91% accuracy compared to 75% for PhD data scientists using XGBoost, with zero feature engineering.

Can TabPFN handle multi-table relational data?

No. TabPFN requires a single flat table as input. If your data spans multiple tables (customers, orders, products, interactions), you must join and flatten everything into one table before TabPFN can use it. This flattening destroys multi-hop relationships and temporal patterns across tables. On RelBench tasks with 5+ tables, this flattening tax costs 15-20+ AUROC points compared to a model that reads relational structure natively.

What is the flattening tax in machine learning on structured data?

The flattening tax is the accuracy loss that occurs when multi-table relational data is joined into a single flat table for use with models like XGBoost, TabPFN, or LightGBM. Flattening destroys multi-hop relationships (e.g., Customer to Orders to Products to Reviews), collapses temporal sequences into static aggregates, and eliminates graph-structural signals. On RelBench benchmarks with 5+ relational tables, the flattening tax costs 15-20+ AUROC points.

How does KumoRFM compare to TabPFN and XGBoost on single-table tasks?

KumoRFM 2.0 is competitive with both TabPFN and XGBoost on single-table prediction tasks. It supports single-table classification and regression with zero-shot inference, similar to TabPFN. The difference becomes stark on multi-table relational data, where KumoRFM outperforms both models by 15-20+ AUROC points because it reads the relational structure natively. KumoRFM is a superset: it does everything TabPFN and XGBoost do on single tables, plus handles the relational problems they cannot.

What benchmarks compare TabPFN, XGBoost, and KumoRFM?

The SAP SALT enterprise benchmark shows KumoRFM at 91% accuracy vs 75% for PhD data scientists with XGBoost vs 63% for LLM+AutoML. On the RelBench benchmark (7 databases, 30 tasks, 103M rows), KumoRFM zero-shot scores 76.71 AUROC compared to 62.44 for LightGBM with expert-engineered features. TabPFN benchmarks on single-table tasks (Nature 2024) show it matching tuned XGBoost on datasets under 10,000 rows. These benchmarks reveal that model choice matters less than data structure: on single tables, all three are competitive; on relational data, only KumoRFM captures the full signal.

TabPFN vs XGBoost for Structured Data Predictions: Benchmarks, Strengths, and What Both Miss | Kumo.ai

Q: When should I use XGBoost instead of TabPFN?

Use XGBoost when your single-table dataset exceeds 10,000-50,000 rows, when you need fast inference latency in production, or when you can invest time in hyperparameter tuning and feature engineering for maximum performance. XGBoost scales to millions of rows, has a mature production ecosystem, and provides interpretable feature importance scores.

TabPFN and XGBoost are both excellent models for structured data - but they are excellent in different regimes. The peer-reviewed evidence is clear on where each one wins.

TabPFN vs XGBoost on single flat tables: what the benchmarks show

TabPFN (PriorLabs, published in Nature 2024) is a transformer-based foundation model pre-trained on millions of synthetic tabular datasets. It uses in-context learning: you pass your data as context, and the model makes predictions immediately - no training, no hyperparameter tuning. On single-table benchmarks with up to 10,000 rows, TabPFN consistently matches or beats carefully tuned XGBoost in approximately 2.8 seconds.

XGBoost (and LightGBM) is the reigning champion for larger structured datasets. It scales efficiently to millions of rows, offers fast inference, and with expert hyperparameter tuning and feature engineering, it remains the top performer on large tabular datasets and Kaggle competitions.

tabpfn_vs_xgboost_single_table_comparison

dimension	TabPFN	XGBoost
Best dataset size	Up to ~10,000 rows	10,000 to millions of rows
Hyperparameter tuning	None required (zero-shot)	Extensive tuning needed for peak performance
Training time	~2.8 seconds (single forward pass)	Minutes to hours (depends on data size and tuning)
Inference speed	Moderate (GPU-dependent)	Very fast (optimized for CPU and GPU)
Accuracy on small data (<10K rows)	Matches or beats tuned XGBoost	Strong, but typically requires tuning to match TabPFN
Accuracy on large data (>50K rows)	Degrades - memory and scaling limitations	Strong - scales efficiently with expert features
Feature engineering	None needed on single tables	Benefits significantly from expert feature engineering
Uncertainty quantification	Built-in (Bayesian posterior)	Requires additional implementation (SHAP, conformal)
Model size	Large (gigabytes)	Compact (megabytes)
Interpretability	Supports SHAP, but more opaque	Native feature importance, highly interpretable
Open-source	Yes (Hugging Face, PriorLabs)	Yes (mature ecosystem, 10+ years)
Data input	Single flat table only	Single flat table only

On single-table benchmarks, the choice is straightforward: TabPFN for small data with zero tuning, XGBoost for large data with expert engineering. Both are strong models in their respective regimes. But notice the last row - both require a single flat table.

When to choose TabPFN

TabPFN is the right choice when your data fits in a single table and is relatively small. Specifically:

Small datasets (under 10,000 rows). TabPFN frequently matches or outperforms tuned XGBoost, with no effort spent on hyperparameter optimization. For rapid prototyping and quick baselines, it is unmatched.
Zero-tuning scenarios. If you want a strong prediction in seconds without touching a single hyperparameter, TabPFN delivers. This is genuinely useful for exploratory analysis, initial feasibility checks, and quick comparisons across datasets.
Uncertainty quantification. TabPFN natively provides prediction intervals via its Bayesian posterior approximation. For applications where calibrated uncertainty matters (medical diagnosis, risk assessment), this is a meaningful advantage over XGBoost, which requires separate implementations.
Noisy or messy data. TabPFN handles missing values, outliers, and uninformative features robustly, often degrading more gracefully than tree-based models when data quality is poor.

When to choose XGBoost

XGBoost is the right choice when your data is large, when you need production-grade inference speed, or when expert tuning is feasible:

Large datasets (50,000+ rows). XGBoost scales efficiently to millions of rows. TabPFN hits memory limitations and performance degradation beyond approximately 10,000-50,000 samples, depending on the version.
Production latency requirements. XGBoost offers extremely fast inference (sub-millisecond per sample) with compact model files (megabytes, not gigabytes). For real-time serving at scale, it remains the practical standard.
Expert feature engineering available. When you have domain experts who know which features matter and can invest weeks in crafting them, XGBoost on well-engineered features is highly competitive. The combination of domain knowledge and gradient boosting is powerful.
Interpretability is critical. XGBoost's native feature importance scores and SHAP integration provide clear explanations of model decisions, which matters in regulated industries like banking and insurance.

The question both models cannot answer: what about multi-table data?

Every TabPFN-vs-XGBoost comparison assumes the same thing: your data is a single flat table. One row per entity, one column per feature. For Kaggle competitions and academic benchmarks, this assumption holds. For enterprise prediction tasks, it almost never does.

A typical enterprise prediction task - customer churn, fraud detection, lead scoring, demand forecasting, recommendation - requires data from 5 to 50 connected tables. Customers, orders, products, reviews, support tickets, interactions, payments, sessions. These tables are connected by foreign keys in a relational database. The predictive signal lives not just in individual tables, but in the relationships between them.

To use TabPFN or XGBoost on this data, someone must first flatten it: write SQL joins to combine tables, compute aggregations (average order value, support ticket count, days since last purchase), and collapse everything into one row per entity. This flattening step is not just tedious - it creates a hard accuracy ceiling that no amount of model sophistication can overcome.

Think about what happens when you flatten a relational database into one table. A customer who is connected to orders, which are connected to products, which are connected to reviews from other customers, who have their own purchase and churn patterns - that entire web of 3rd-degree and 4th-degree connections collapses into a single row with a few aggregate columns: order_count = 23, avg_order_value = $142. The rich graph of relationships - the connections that actually predict whether this customer will churn, commit fraud, or convert - is gone. Permanently. No model trained on that flattened row can recover what the data no longer contains. This is not a modeling limitation. It is an information limitation. Flattening does not create a penalty that a better algorithm can overcome - it creates a ceiling.

what_flattening_destroys

signal_type	available_in_flat_table	available_in_relational_graph
Direct customer attributes	Yes	Yes
Simple aggregates (order count, avg value)	Yes - if manually engineered	Yes - discovered automatically
Temporal sequences (purchase acceleration/deceleration)	No - collapsed to static aggregates	Yes - full event timeline preserved
Multi-hop patterns (Customer > Orders > Products > Reviews > Similar customers' outcomes)	No - requires 3-4 table traversal	Yes - discovered automatically across tables
Graph neighborhood (what other entities share connections)	No - flat table has no graph structure	Yes - community detection and structural similarity
Cross-table temporal patterns (support ticket spike followed by order decline)	No - requires cross-table temporal join	Yes - temporal correlation across table boundaries

The first two rows are the only signals that survive flattening. The remaining four - temporal sequences, multi-hop patterns, graph structure, and cross-table dynamics - are destroyed before TabPFN or XGBoost ever sees the data. On relational benchmarks with 5+ tables, these hidden signals account for 15-20+ AUROC points of accuracy.

This is not a theoretical concern. On the RelBench benchmark (7 databases, 30 prediction tasks, 103 million rows across 51 tables), models that read flattened single tables score 62.44 AUROC. Models that read the relational structure natively score 76.71 AUROC zero-shot. The 14+ point gap is the cost of flattening - the information that TabPFN and XGBoost never see because it was destroyed before they received the data.

KumoRFM 2.0: the model that does both

KumoRFM 2.0 is a relational foundation model built by Kumo.ai. Unlike TabPFN (which reads one flat table) or XGBoost (which reads one flat table), KumoRFM reads multiple relational tables connected by foreign keys and discovers predictive patterns across the full relational graph using a graph transformer architecture.

But KumoRFM 2.0 is not just a relational model. It is a superset. It supports both single-table and multi-table prediction tasks:

On single-table tasks: KumoRFM 2.0 is competitive with TabPFN and XGBoost. It provides zero-shot predictions on flat tables with no feature engineering, similar to TabPFN's approach.
On multi-table relational tasks: KumoRFM 2.0 dramatically outperforms both. It reads 5-50 connected tables directly, discovers multi-hop predictive patterns (2-hop, 3-hop, 4+ hop signals across table boundaries), captures temporal sequences across tables, and detects graph-structural patterns - all automatically, with zero feature engineering.

This means enterprise teams do not need to choose between a small-data model (TabPFN) and a large-data model (XGBoost). They get one foundation model that covers both regimes - plus the relational dimension that neither TabPFN nor XGBoost can touch.

Why pre-training matters: the second gap flat-table models cannot close

Even if you could somehow flatten your relational data perfectly - capturing every aggregation, every temporal window, every cross-table metric - you would still face a second, fundamental disadvantage. The model itself has never seen relational patterns before.

TabPFN is pre-trained on millions of synthetic single-table datasets. It has learned the statistical patterns common to flat tabular data: feature correlations, nonlinear decision boundaries, class distributions. This is why it excels on single-table benchmarks. But it has never encountered a multi-hop relationship, a cross-table temporal sequence, or a graph-structural pattern - because those do not exist in single-table data.

XGBoost is not pre-trained at all. It learns from scratch on whatever features a data scientist provides. It is powerful on the features it receives, but it has no prior knowledge of any data structure.

KumoRFM is pre-trained on tens of thousands of diverse real relational datasets spanning different industries, schemas, and entity types. It has already learned what relational patterns look like: how multi-hop connections carry predictive signal, how temporal patterns propagate across table boundaries, how graph-structural properties predict entity behavior. When KumoRFM encounters your enterprise database, it recognizes relational patterns it has seen across thousands of prior databases. It does not search for patterns from scratch - it recognizes them.

three_way_comparison

dimension	TabPFN	XGBoost / LightGBM	KumoRFM 2.0
Data input	Single flat table	Single flat table	Single table OR multiple relational tables
Architecture	Transformer (in-context learning)	Gradient-boosted trees	Graph transformer over relational structure
Best single-table size	Up to ~10K rows	10K to millions of rows	Any size - single or multi-table
Multi-table support	None - requires flattening	None - requires flattening	Native - reads 5-50 connected tables directly
Multi-hop pattern discovery	Not possible	Not possible	Native - 2-hop, 3-hop, 4+ hop signals
Feature engineering	None (single table)	Extensive (12.3 hrs, 878 lines of code per task)	None - automatic across all tables
Training required	None (zero-shot)	Hours to days of training + tuning	None for zero-shot; optional fine-tuning
Inference speed	~2.8 seconds	Sub-millisecond per sample	~1 second (zero-shot)
Enterprise scale	~50K rows (open-source), 10M (enterprise)	Millions of rows (single table)	Hundreds of millions of rows across dozens of tables
Data warehouse integration	None - export to Python	None - export to Python	Native Snowflake/Databricks - no data movement
Open-source	Yes (Hugging Face)	Yes (mature ecosystem)	No (enterprise SaaS)

KumoRFM 2.0 is competitive on single-table tasks and dominant on multi-table relational tasks. It is the only model in this comparison that does not require flattening relational data into a single table.

Enterprise benchmarks: SAP SALT

The SAP SALT benchmark is an enterprise-grade evaluation where real business analysts and data scientists attempt prediction tasks on SAP enterprise data - production-quality databases with multiple related tables. Unlike academic single-table benchmarks, SAP SALT reflects how enterprise data actually looks: multiple connected tables, complex relationships, and real business outcomes to predict.

sap_salt_enterprise_benchmark

approach	accuracy	setup_effort
LLM + AutoML	63%	LLM generates features, AutoML selects model - automated but limited
PhD Data Scientist + XGBoost	75%	Weeks of expert feature engineering, hand-tuned gradient boosting
KumoRFM 2.0 (zero-shot)	91%	Zero feature engineering, zero training - reads relational tables directly

SAP SALT enterprise benchmark: KumoRFM 2.0 outperforms PhD-level data scientists with hand-tuned XGBoost by 16 percentage points. The gap is not about a better algorithm - it is about seeing data that flat-table models cannot access. Note: TabPFN is not included in SAP SALT because the benchmark uses multi-table enterprise data that exceeds TabPFN's single-table input requirement.

The 16 percentage point gap between KumoRFM (91%) and PhD+XGBoost (75%) is not marginal. In enterprise terms, that difference translates to millions of dollars in caught fraud, retained customers, or converted leads. And KumoRFM achieves it with zero feature engineering - no SQL joins, no aggregation pipelines, no weeks of data scientist labor.

Research benchmarks: RelBench

RelBench is an academic benchmark designed specifically to test models on relational data: 7 real-world databases, 30 prediction tasks, 103 million rows across 51 tables. It is the only major benchmark that preserves the relational structure of the data instead of pre-flattening it.

AUROC (Area Under the Receiver Operating Characteristic curve) measures how well a model distinguishes between positive and negative outcomes. An AUROC of 50 means random guessing. An AUROC of 100 means perfect prediction. Moving from 62 to 77 AUROC is a major improvement - it means the model correctly ranks a true positive above a true negative 77% of the time instead of 62%. For fraud detection, that difference means catching significantly more fraud with fewer false alarms.

relbench_benchmark_relational_data

approach	AUROC	feature_engineering	data_input
LightGBM + expert features (flattened)	62.44	12.3 hours per task, 878 lines of code	Single flat table (after manual joins)
XGBoost + expert features (flattened)	~63-65	12.3 hours per task, 878 lines of code	Single flat table (after manual joins)
Graph Neural Networks	75.83	Moderate - still requires schema definition	Relational (but needs custom architecture)
KumoRFM 2.0 (zero-shot)	76.71	Zero - automatic	Relational tables directly (no flattening)
KumoRFM 2.0 (fine-tuned)	81.14	Minutes of fine-tuning	Relational tables directly (no flattening)

RelBench results: KumoRFM zero-shot outperforms LightGBM with expert-engineered features by 14+ AUROC points. Fine-tuned KumoRFM extends this lead to 19+ points. The gap comes from multi-hop relational signals that flat-table models cannot access.

The key result: even the best flat-table approaches (XGBoost/LightGBM with weeks of expert feature engineering) score in the low 60s on relational data. KumoRFM zero-shot, with no feature engineering at all, scores 76.71. Fine-tuned, it reaches 81.14. The gap is not about better gradient boosting or a better transformer - it is about what data the model can see.

TabPFN or XGBoost workflow (relational data)

Export data from your relational database
Write SQL joins to combine 5-50 tables (hours to weeks of engineering)
Manually compute aggregations, temporal features, cross-table metrics
Lose multi-hop relationships, temporal sequences, graph structure in the process
Feed the flattened single table to TabPFN (~2.8s) or train XGBoost (hours)
Get predictions limited by what survived the flattening step

KumoRFM 2.0 workflow

Connect to your data warehouse (Snowflake, Databricks - one-time setup)
Write a PQL query defining what you want to predict
KumoRFM reads all relational tables, discovers multi-hop patterns automatically
Zero flattening, zero feature engineering, zero information loss
Zero-shot prediction in ~1 second, or fine-tune in minutes for maximum accuracy
Get predictions powered by the full relational structure - every signal preserved

A concrete example: churn prediction with multi-table data

Consider a SaaS company predicting 90-day customer churn. The data lives across 6 tables: customers, subscriptions, product_usage, support_tickets, invoices, and feature_requests.

The strongest churn predictor in this data is a 4-hop pattern: a customer's churn risk spikes when other customers who use the same product features, filed similar support tickets, and have similar usage trajectories started churning recently. This signal requires traversing Customer → Product_usage → Features → Other_customers_using_same_features → Their churn outcomes.

PQL Query

PREDICT churn_90d
FOR EACH customers.customer_id
WHERE customers.contract_type = 'enterprise'

One PQL query replaces the entire flattening pipeline. KumoRFM 2.0 reads the raw customers, subscriptions, product_usage, support_tickets, invoices, and feature_requests tables directly. The 4-hop churn signal is discovered automatically.

Output

customer_id	churn_prob_kumo	churn_prob_flat_model	why_kumo_differs
C-4201	0.92	0.58	Kumo detects: similar-usage customers churned after same support pattern
C-4202	0.11	0.35	Kumo correctly lower: expanding usage across 3 product modules
C-4203	0.87	0.49	Kumo detects: feature request stall + invoice dispute + usage decline
C-4204	0.05	0.08	Both correctly low: healthy account with strong engagement signals

The decision framework: which model should you use?

The right model depends on two questions: what is your data structure, and what is your dataset size?

decision_framework

your_data_structure	your_dataset_size	recommended_model	why
Single flat table	Under 10,000 rows	TabPFN or KumoRFM 2.0	Both deliver strong accuracy with zero tuning. TabPFN is open-source and free.
Single flat table	10,000-50,000 rows	XGBoost/LightGBM or KumoRFM 2.0	XGBoost scales well here with tuning. KumoRFM provides comparable zero-shot accuracy.
Single flat table	50,000+ rows	XGBoost/LightGBM or KumoRFM 2.0	XGBoost is the proven workhorse. KumoRFM handles this scale natively in your warehouse.
Multiple relational tables (2-4)	Any size	KumoRFM 2.0	Multi-hop signals start appearing. Flattening tax: 5-10 AUROC points.
Multiple relational tables (5+)	Any size	KumoRFM 2.0	Flattening tax reaches 15-20+ AUROC points. No flat-table model can compete.
Enterprise relational database	Millions of rows, dozens of tables	KumoRFM 2.0	Purpose-built for this operating range. Runs natively in Snowflake/Databricks.

For single flat tables, all three models are viable. For relational data - which describes most enterprise prediction tasks - KumoRFM 2.0 is the only model that reads the full relational structure without information-destroying flattening.

Why this comparison is usually incomplete

Most TabPFN-vs-XGBoost comparisons focus exclusively on single-table benchmarks: Kaggle datasets, UCI datasets, synthetic classification tasks. On these benchmarks, the conclusion is clear and correct: TabPFN wins on small data, XGBoost wins on large data.

But these benchmarks share a structural assumption that does not hold in enterprise settings: the data has already been flattened into a single table. Someone has already done the SQL joins, the feature engineering, the aggregation. The benchmark measures model performance after the hardest and most lossy step has already happened.

For enterprise ML teams, the real question is not “which model is best on a flat table?” It is “which approach gives the most accurate predictions on my actual data - which lives in a relational database with multiple connected tables?” When you ask that question, the answer changes fundamentally. The choice is no longer between TabPFN and XGBoost. It is between flattening your data (and losing 15-20+ AUROC points of signal) or using a model that reads the relational structure directly.

Key Takeaways

1On single flat tables, TabPFN and XGBoost are both excellent. TabPFN dominates small datasets (under 10K rows) with zero tuning. XGBoost dominates larger datasets with expert feature engineering. This is well-established in peer-reviewed benchmarks.
2Both models require a single flat table as input. Enterprise prediction tasks (churn, fraud, lead scoring, demand forecasting) typically involve 5-50 relational tables. Flattening this data into one table permanently destroys 3rd-degree and 4th-degree connections, temporal sequences, and graph-structural signals. This is not a penalty a better algorithm can overcome - it is a hard accuracy ceiling created by information loss.
3Flat-table models face two compounding disadvantages: (1) flattening destroys the relational signals that most strongly predict outcomes, and (2) neither TabPFN (pre-trained on synthetic single tables) nor XGBoost (no pre-training at all) has ever learned relational patterns. KumoRFM is pre-trained on tens of thousands of real relational datasets, so it recognizes cross-table patterns that flat-table models have no knowledge of. These two gaps compound to create the 15-20+ AUROC point difference on enterprise benchmarks.
4KumoRFM 2.0 is a superset: competitive with TabPFN and XGBoost on single-table tasks, and dominant on multi-table relational data (91% vs 75% on SAP SALT, 76.71 vs 62.44 AUROC on RelBench). Enterprise teams get one foundation model instead of choosing between a small-data model and a large-data model - and it handles the relational dimension that neither can touch.
5The right model depends on your data structure, not just your data size. If your data fits in one table, all three models work. If it spans multiple relational tables - as most enterprise data does - KumoRFM 2.0 is the only model that captures the full predictive signal without information-destroying flattening.

TabPFN vs XGBoost for Structured Data Predictions