TabPFN and XGBoost are both excellent models for structured data - but they are excellent in different regimes. The peer-reviewed evidence is clear on where each one wins.
TabPFN vs XGBoost on single flat tables: what the benchmarks show
TabPFN (PriorLabs, published in Nature 2024) is a transformer-based foundation model pre-trained on millions of synthetic tabular datasets. It uses in-context learning: you pass your data as context, and the model makes predictions immediately - no training, no hyperparameter tuning. On single-table benchmarks with up to 10,000 rows, TabPFN consistently matches or beats carefully tuned XGBoost in approximately 2.8 seconds.
XGBoost (and LightGBM) is the reigning champion for larger structured datasets. It scales efficiently to millions of rows, offers fast inference, and with expert hyperparameter tuning and feature engineering, it remains the top performer on large tabular datasets and Kaggle competitions.
tabpfn_vs_xgboost_single_table_comparison
| dimension | TabPFN | XGBoost |
|---|---|---|
| Best dataset size | Up to ~10,000 rows | 10,000 to millions of rows |
| Hyperparameter tuning | None required (zero-shot) | Extensive tuning needed for peak performance |
| Training time | ~2.8 seconds (single forward pass) | Minutes to hours (depends on data size and tuning) |
| Inference speed | Moderate (GPU-dependent) | Very fast (optimized for CPU and GPU) |
| Accuracy on small data (<10K rows) | Matches or beats tuned XGBoost | Strong, but typically requires tuning to match TabPFN |
| Accuracy on large data (>50K rows) | Degrades - memory and scaling limitations | Strong - scales efficiently with expert features |
| Feature engineering | None needed on single tables | Benefits significantly from expert feature engineering |
| Uncertainty quantification | Built-in (Bayesian posterior) | Requires additional implementation (SHAP, conformal) |
| Model size | Large (gigabytes) | Compact (megabytes) |
| Interpretability | Supports SHAP, but more opaque | Native feature importance, highly interpretable |
| Open-source | Yes (Hugging Face, PriorLabs) | Yes (mature ecosystem, 10+ years) |
| Data input | Single flat table only | Single flat table only |
On single-table benchmarks, the choice is straightforward: TabPFN for small data with zero tuning, XGBoost for large data with expert engineering. Both are strong models in their respective regimes. But notice the last row - both require a single flat table.
When to choose TabPFN
TabPFN is the right choice when your data fits in a single table and is relatively small. Specifically:
- Small datasets (under 10,000 rows). TabPFN frequently matches or outperforms tuned XGBoost, with no effort spent on hyperparameter optimization. For rapid prototyping and quick baselines, it is unmatched.
- Zero-tuning scenarios. If you want a strong prediction in seconds without touching a single hyperparameter, TabPFN delivers. This is genuinely useful for exploratory analysis, initial feasibility checks, and quick comparisons across datasets.
- Uncertainty quantification. TabPFN natively provides prediction intervals via its Bayesian posterior approximation. For applications where calibrated uncertainty matters (medical diagnosis, risk assessment), this is a meaningful advantage over XGBoost, which requires separate implementations.
- Noisy or messy data. TabPFN handles missing values, outliers, and uninformative features robustly, often degrading more gracefully than tree-based models when data quality is poor.
When to choose XGBoost
XGBoost is the right choice when your data is large, when you need production-grade inference speed, or when expert tuning is feasible:
- Large datasets (50,000+ rows). XGBoost scales efficiently to millions of rows. TabPFN hits memory limitations and performance degradation beyond approximately 10,000-50,000 samples, depending on the version.
- Production latency requirements. XGBoost offers extremely fast inference (sub-millisecond per sample) with compact model files (megabytes, not gigabytes). For real-time serving at scale, it remains the practical standard.
- Expert feature engineering available. When you have domain experts who know which features matter and can invest weeks in crafting them, XGBoost on well-engineered features is highly competitive. The combination of domain knowledge and gradient boosting is powerful.
- Interpretability is critical. XGBoost's native feature importance scores and SHAP integration provide clear explanations of model decisions, which matters in regulated industries like banking and insurance.
The question both models cannot answer: what about multi-table data?
Every TabPFN-vs-XGBoost comparison assumes the same thing: your data is a single flat table. One row per entity, one column per feature. For Kaggle competitions and academic benchmarks, this assumption holds. For enterprise prediction tasks, it almost never does.
A typical enterprise prediction task - customer churn, fraud detection, lead scoring, demand forecasting, recommendation - requires data from 5 to 50 connected tables. Customers, orders, products, reviews, support tickets, interactions, payments, sessions. These tables are connected by foreign keys in a relational database. The predictive signal lives not just in individual tables, but in the relationships between them.
To use TabPFN or XGBoost on this data, someone must first flatten it: write SQL joins to combine tables, compute aggregations (average order value, support ticket count, days since last purchase), and collapse everything into one row per entity. This flattening step is not just tedious - it creates a hard accuracy ceiling that no amount of model sophistication can overcome.
Think about what happens when you flatten a relational database into one table. A customer who is connected to orders, which are connected to products, which are connected to reviews from other customers, who have their own purchase and churn patterns - that entire web of 3rd-degree and 4th-degree connections collapses into a single row with a few aggregate columns: order_count = 23, avg_order_value = $142. The rich graph of relationships - the connections that actually predict whether this customer will churn, commit fraud, or convert - is gone. Permanently. No model trained on that flattened row can recover what the data no longer contains. This is not a modeling limitation. It is an information limitation. Flattening does not create a penalty that a better algorithm can overcome - it creates a ceiling.
what_flattening_destroys
| signal_type | available_in_flat_table | available_in_relational_graph |
|---|---|---|
| Direct customer attributes | Yes | Yes |
| Simple aggregates (order count, avg value) | Yes - if manually engineered | Yes - discovered automatically |
| Temporal sequences (purchase acceleration/deceleration) | No - collapsed to static aggregates | Yes - full event timeline preserved |
| Multi-hop patterns (Customer > Orders > Products > Reviews > Similar customers' outcomes) | No - requires 3-4 table traversal | Yes - discovered automatically across tables |
| Graph neighborhood (what other entities share connections) | No - flat table has no graph structure | Yes - community detection and structural similarity |
| Cross-table temporal patterns (support ticket spike followed by order decline) | No - requires cross-table temporal join | Yes - temporal correlation across table boundaries |
The first two rows are the only signals that survive flattening. The remaining four - temporal sequences, multi-hop patterns, graph structure, and cross-table dynamics - are destroyed before TabPFN or XGBoost ever sees the data. On relational benchmarks with 5+ tables, these hidden signals account for 15-20+ AUROC points of accuracy.
This is not a theoretical concern. On the RelBench benchmark (7 databases, 30 prediction tasks, 103 million rows across 51 tables), models that read flattened single tables score 62.44 AUROC. Models that read the relational structure natively score 76.71 AUROC zero-shot. The 14+ point gap is the cost of flattening - the information that TabPFN and XGBoost never see because it was destroyed before they received the data.
KumoRFM 2.0: the model that does both
KumoRFM 2.0 is a relational foundation model built by Kumo.ai. Unlike TabPFN (which reads one flat table) or XGBoost (which reads one flat table), KumoRFM reads multiple relational tables connected by foreign keys and discovers predictive patterns across the full relational graph using a graph transformer architecture.
But KumoRFM 2.0 is not just a relational model. It is a superset. It supports both single-table and multi-table prediction tasks:
- On single-table tasks: KumoRFM 2.0 is competitive with TabPFN and XGBoost. It provides zero-shot predictions on flat tables with no feature engineering, similar to TabPFN's approach.
- On multi-table relational tasks: KumoRFM 2.0 dramatically outperforms both. It reads 5-50 connected tables directly, discovers multi-hop predictive patterns (2-hop, 3-hop, 4+ hop signals across table boundaries), captures temporal sequences across tables, and detects graph-structural patterns - all automatically, with zero feature engineering.
This means enterprise teams do not need to choose between a small-data model (TabPFN) and a large-data model (XGBoost). They get one foundation model that covers both regimes - plus the relational dimension that neither TabPFN nor XGBoost can touch.
Why pre-training matters: the second gap flat-table models cannot close
Even if you could somehow flatten your relational data perfectly - capturing every aggregation, every temporal window, every cross-table metric - you would still face a second, fundamental disadvantage. The model itself has never seen relational patterns before.
TabPFN is pre-trained on millions of synthetic single-table datasets. It has learned the statistical patterns common to flat tabular data: feature correlations, nonlinear decision boundaries, class distributions. This is why it excels on single-table benchmarks. But it has never encountered a multi-hop relationship, a cross-table temporal sequence, or a graph-structural pattern - because those do not exist in single-table data.
XGBoost is not pre-trained at all. It learns from scratch on whatever features a data scientist provides. It is powerful on the features it receives, but it has no prior knowledge of any data structure.
KumoRFM is pre-trained on tens of thousands of diverse real relational datasets spanning different industries, schemas, and entity types. It has already learned what relational patterns look like: how multi-hop connections carry predictive signal, how temporal patterns propagate across table boundaries, how graph-structural properties predict entity behavior. When KumoRFM encounters your enterprise database, it recognizes relational patterns it has seen across thousands of prior databases. It does not search for patterns from scratch - it recognizes them.
three_way_comparison
| dimension | TabPFN | XGBoost / LightGBM | KumoRFM 2.0 |
|---|---|---|---|
| Data input | Single flat table | Single flat table | Single table OR multiple relational tables |
| Architecture | Transformer (in-context learning) | Gradient-boosted trees | Graph transformer over relational structure |
| Best single-table size | Up to ~10K rows | 10K to millions of rows | Any size - single or multi-table |
| Multi-table support | None - requires flattening | None - requires flattening | Native - reads 5-50 connected tables directly |
| Multi-hop pattern discovery | Not possible | Not possible | Native - 2-hop, 3-hop, 4+ hop signals |
| Feature engineering | None (single table) | Extensive (12.3 hrs, 878 lines of code per task) | None - automatic across all tables |
| Training required | None (zero-shot) | Hours to days of training + tuning | None for zero-shot; optional fine-tuning |
| Inference speed | ~2.8 seconds | Sub-millisecond per sample | ~1 second (zero-shot) |
| Enterprise scale | ~50K rows (open-source), 10M (enterprise) | Millions of rows (single table) | Hundreds of millions of rows across dozens of tables |
| Data warehouse integration | None - export to Python | None - export to Python | Native Snowflake/Databricks - no data movement |
| Open-source | Yes (Hugging Face) | Yes (mature ecosystem) | No (enterprise SaaS) |
KumoRFM 2.0 is competitive on single-table tasks and dominant on multi-table relational tasks. It is the only model in this comparison that does not require flattening relational data into a single table.
Enterprise benchmarks: SAP SALT
The SAP SALT benchmark is an enterprise-grade evaluation where real business analysts and data scientists attempt prediction tasks on SAP enterprise data - production-quality databases with multiple related tables. Unlike academic single-table benchmarks, SAP SALT reflects how enterprise data actually looks: multiple connected tables, complex relationships, and real business outcomes to predict.
sap_salt_enterprise_benchmark
| approach | accuracy | setup_effort |
|---|---|---|
| LLM + AutoML | 63% | LLM generates features, AutoML selects model - automated but limited |
| PhD Data Scientist + XGBoost | 75% | Weeks of expert feature engineering, hand-tuned gradient boosting |
| KumoRFM 2.0 (zero-shot) | 91% | Zero feature engineering, zero training - reads relational tables directly |
SAP SALT enterprise benchmark: KumoRFM 2.0 outperforms PhD-level data scientists with hand-tuned XGBoost by 16 percentage points. The gap is not about a better algorithm - it is about seeing data that flat-table models cannot access. Note: TabPFN is not included in SAP SALT because the benchmark uses multi-table enterprise data that exceeds TabPFN's single-table input requirement.
The 16 percentage point gap between KumoRFM (91%) and PhD+XGBoost (75%) is not marginal. In enterprise terms, that difference translates to millions of dollars in caught fraud, retained customers, or converted leads. And KumoRFM achieves it with zero feature engineering - no SQL joins, no aggregation pipelines, no weeks of data scientist labor.
Research benchmarks: RelBench
RelBench is an academic benchmark designed specifically to test models on relational data: 7 real-world databases, 30 prediction tasks, 103 million rows across 51 tables. It is the only major benchmark that preserves the relational structure of the data instead of pre-flattening it.
AUROC (Area Under the Receiver Operating Characteristic curve) measures how well a model distinguishes between positive and negative outcomes. An AUROC of 50 means random guessing. An AUROC of 100 means perfect prediction. Moving from 62 to 77 AUROC is a major improvement - it means the model correctly ranks a true positive above a true negative 77% of the time instead of 62%. For fraud detection, that difference means catching significantly more fraud with fewer false alarms.
relbench_benchmark_relational_data
| approach | AUROC | feature_engineering | data_input |
|---|---|---|---|
| LightGBM + expert features (flattened) | 62.44 | 12.3 hours per task, 878 lines of code | Single flat table (after manual joins) |
| XGBoost + expert features (flattened) | ~63-65 | 12.3 hours per task, 878 lines of code | Single flat table (after manual joins) |
| Graph Neural Networks | 75.83 | Moderate - still requires schema definition | Relational (but needs custom architecture) |
| KumoRFM 2.0 (zero-shot) | 76.71 | Zero - automatic | Relational tables directly (no flattening) |
| KumoRFM 2.0 (fine-tuned) | 81.14 | Minutes of fine-tuning | Relational tables directly (no flattening) |
RelBench results: KumoRFM zero-shot outperforms LightGBM with expert-engineered features by 14+ AUROC points. Fine-tuned KumoRFM extends this lead to 19+ points. The gap comes from multi-hop relational signals that flat-table models cannot access.
The key result: even the best flat-table approaches (XGBoost/LightGBM with weeks of expert feature engineering) score in the low 60s on relational data. KumoRFM zero-shot, with no feature engineering at all, scores 76.71. Fine-tuned, it reaches 81.14. The gap is not about better gradient boosting or a better transformer - it is about what data the model can see.
TabPFN or XGBoost workflow (relational data)
- Export data from your relational database
- Write SQL joins to combine 5-50 tables (hours to weeks of engineering)
- Manually compute aggregations, temporal features, cross-table metrics
- Lose multi-hop relationships, temporal sequences, graph structure in the process
- Feed the flattened single table to TabPFN (~2.8s) or train XGBoost (hours)
- Get predictions limited by what survived the flattening step
KumoRFM 2.0 workflow
- Connect to your data warehouse (Snowflake, Databricks - one-time setup)
- Write a PQL query defining what you want to predict
- KumoRFM reads all relational tables, discovers multi-hop patterns automatically
- Zero flattening, zero feature engineering, zero information loss
- Zero-shot prediction in ~1 second, or fine-tune in minutes for maximum accuracy
- Get predictions powered by the full relational structure - every signal preserved
A concrete example: churn prediction with multi-table data
Consider a SaaS company predicting 90-day customer churn. The data lives across 6 tables: customers, subscriptions, product_usage, support_tickets, invoices, and feature_requests.
The strongest churn predictor in this data is a 4-hop pattern: a customer's churn risk spikes when other customers who use the same product features, filed similar support tickets, and have similar usage trajectories started churning recently. This signal requires traversing Customer → Product_usage → Features → Other_customers_using_same_features → Their churn outcomes.
PQL Query
PREDICT churn_90d FOR EACH customers.customer_id WHERE customers.contract_type = 'enterprise'
One PQL query replaces the entire flattening pipeline. KumoRFM 2.0 reads the raw customers, subscriptions, product_usage, support_tickets, invoices, and feature_requests tables directly. The 4-hop churn signal is discovered automatically.
Output
| customer_id | churn_prob_kumo | churn_prob_flat_model | why_kumo_differs |
|---|---|---|---|
| C-4201 | 0.92 | 0.58 | Kumo detects: similar-usage customers churned after same support pattern |
| C-4202 | 0.11 | 0.35 | Kumo correctly lower: expanding usage across 3 product modules |
| C-4203 | 0.87 | 0.49 | Kumo detects: feature request stall + invoice dispute + usage decline |
| C-4204 | 0.05 | 0.08 | Both correctly low: healthy account with strong engagement signals |
The decision framework: which model should you use?
The right model depends on two questions: what is your data structure, and what is your dataset size?
decision_framework
| your_data_structure | your_dataset_size | recommended_model | why |
|---|---|---|---|
| Single flat table | Under 10,000 rows | TabPFN or KumoRFM 2.0 | Both deliver strong accuracy with zero tuning. TabPFN is open-source and free. |
| Single flat table | 10,000-50,000 rows | XGBoost/LightGBM or KumoRFM 2.0 | XGBoost scales well here with tuning. KumoRFM provides comparable zero-shot accuracy. |
| Single flat table | 50,000+ rows | XGBoost/LightGBM or KumoRFM 2.0 | XGBoost is the proven workhorse. KumoRFM handles this scale natively in your warehouse. |
| Multiple relational tables (2-4) | Any size | KumoRFM 2.0 | Multi-hop signals start appearing. Flattening tax: 5-10 AUROC points. |
| Multiple relational tables (5+) | Any size | KumoRFM 2.0 | Flattening tax reaches 15-20+ AUROC points. No flat-table model can compete. |
| Enterprise relational database | Millions of rows, dozens of tables | KumoRFM 2.0 | Purpose-built for this operating range. Runs natively in Snowflake/Databricks. |
For single flat tables, all three models are viable. For relational data - which describes most enterprise prediction tasks - KumoRFM 2.0 is the only model that reads the full relational structure without information-destroying flattening.
Why this comparison is usually incomplete
Most TabPFN-vs-XGBoost comparisons focus exclusively on single-table benchmarks: Kaggle datasets, UCI datasets, synthetic classification tasks. On these benchmarks, the conclusion is clear and correct: TabPFN wins on small data, XGBoost wins on large data.
But these benchmarks share a structural assumption that does not hold in enterprise settings: the data has already been flattened into a single table. Someone has already done the SQL joins, the feature engineering, the aggregation. The benchmark measures model performance after the hardest and most lossy step has already happened.
For enterprise ML teams, the real question is not “which model is best on a flat table?” It is “which approach gives the most accurate predictions on my actual data - which lives in a relational database with multiple connected tables?” When you ask that question, the answer changes fundamentally. The choice is no longer between TabPFN and XGBoost. It is between flattening your data (and losing 15-20+ AUROC points of signal) or using a model that reads the relational structure directly.