A model with 95% accuracy sounds impressive. Until you learn that 95% of the data belongs to one class, and the model just predicted that class every time. A model with 0.82 AUROC sounds mediocre. Until you learn it was evaluated with temporal splits on a 30-task benchmark covering 103 million rows of real-world relational data.
Model evaluation is where most ML projects go wrong. Not because teams pick the wrong metric, but because they pick the right metric and measure it the wrong way. Random splits inflate accuracy. Single-task evaluations hide generalization failures. Aggregate numbers obscure performance on the slices that matter most.
This guide covers the metrics you need to know, the evaluation mistakes that cost real money, and how the RelBench benchmark established a new standard for evaluating ML on relational data.
Classification metrics: what each one reveals and hides
Accuracy
What it measures: The fraction of predictions that are correct. A model that predicts 90 out of 100 labels correctly has 90% accuracy.
What it hides: Class imbalance. In fraud detection, where 0.1% of transactions are fraudulent, a model that labels everything as legitimate achieves 99.9% accuracy while catching zero fraud. Accuracy is useful only when classes are roughly balanced, which is rare in enterprise ML.
Precision
What it measures: Of all the cases the model flagged as positive, how many actually were positive. If the model flags 100 transactions as fraud and 80 are actually fraudulent, precision is 80%.
What it hides: Missed positives. A model that only flags the 10 most obvious fraud cases might achieve 100% precision while missing 990 out of 1,000 actual fraud cases. High precision tells you the model's alerts are reliable, not that it catches everything.
When to prioritize: When false positives are expensive. Marketing campaigns where each contacted lead costs $50. Account lockouts where false positives drive customer attrition. Medical diagnoses where false positives trigger invasive procedures.
Recall (sensitivity)
What it measures: Of all the actual positive cases, how many did the model catch. If there are 1,000 fraudulent transactions and the model catches 700, recall is 70%.
What it hides: False positive volume. A model that flags every transaction as fraud achieves 100% recall while generating millions of false alerts. High recall tells you the model is comprehensive, not that it is precise.
When to prioritize: When missed positives are catastrophic. Anti-money laundering where missing a suspicious transaction can result in regulatory fines of $10M or more. Cancer screening where a missed diagnosis has irreversible consequences. Security intrusion detection where one missed breach compromises the entire system.
F1 score
What it measures: The harmonic mean of precision and recall. It balances both concerns into a single number between 0 and 1.
What it hides: Which side of the trade-off matters more to your business. An F1 of 0.75 could come from precision 0.90 and recall 0.64, or from precision 0.64 and recall 0.90. Those are very different models with very different operational consequences. F1 is convenient for model comparison but insufficient for production deployment decisions.
AUROC
What it measures: The probability that the model ranks a randomly chosen positive example higher than a randomly chosen negative example. AUROC ranges from 0.5 (random) to 1.0 (perfect ranking). It evaluates the model across all possible classification thresholds simultaneously.
What it hides: Performance at specific operating points. An AUROC of 0.85 tells you the model has good overall ranking ability, but not whether it works well at the threshold your application uses. Two models with identical AUROC can have very different precision-recall curves.
When to use: AUROC is the standard metric for comparing models when you have not yet decided on an operating threshold. RelBench uses AUROC as its primary classification metric across all 30 tasks. On that benchmark, KumoRFM achieves 76.71 zero-shot and 81.14 fine-tuned, vs. 75.83 for a supervised GNN and 62.44 for LightGBM with manual features.
classification_metrics_comparison
| Metric | Range | Best For | Misleads When | RelBench Standard |
|---|---|---|---|---|
| Accuracy | 0-100% | Balanced classes | Imbalanced data (99.9% = trivial) | No |
| Precision | 0-1.0 | Cost of false positives high | Model flags very few cases | No |
| Recall | 0-1.0 | Cost of missed cases high | Model flags everything | No |
| F1 Score | 0-1.0 | Balancing precision/recall | Hides which side matters more | No |
| AUROC | 0.5-1.0 | Model comparison, ranking | Extreme class imbalance | Yes |
| AUPRC | 0-1.0 | Rare event detection | Balanced datasets | Supplementary |
No single metric tells the full story. AUROC is the standard for model comparison; precision/recall matter at your chosen operating threshold.
AUPRC (average precision)
What it measures: The area under the precision-recall curve. Unlike AUROC, AUPRC is sensitive to class imbalance. For a dataset with 0.1% positive rate, random guessing gives AUROC of 0.5 but AUPRC of only 0.001.
When to use: Any task with severe class imbalance: fraud detection (0.1% fraud rate), rare disease prediction (0.01% prevalence), anomaly detection. AUPRC gives a more honest picture of model performance in these scenarios.
Regression metrics: beyond the average error
MAE (Mean Absolute Error)
What it measures: The average absolute difference between predicted and actual values. If you predict customer lifetime value and your MAE is $50, your predictions are off by $50 on average.
What it hides: The distribution of errors. An MAE of $50 could mean every prediction is off by exactly $50, or it could mean 90% are off by $5 and 10% are off by $455. For business decisions, the tail matters more than the average.
RMSE (Root Mean Squared Error)
What it measures: The square root of the average squared error. RMSE penalizes large errors more than MAE. If most predictions are close but a few are wildly off, RMSE will be much higher than MAE.
When to use: When large errors are disproportionately costly. Demand forecasting where overestimating by 10,000 units is far worse than overestimating by 100 units. Financial risk where tail errors represent systemic risk.
MAPE (Mean Absolute Percentage Error)
What it measures: The average percentage error, making it scale-independent. A MAPE of 10% means predictions are off by 10% on average regardless of whether the underlying values are $10 or $10,000.
What it hides: MAPE explodes when actual values are near zero. A prediction of $5 for an actual value of $1 gives 400% error, which dominates the average. Use symmetric MAPE (sMAPE) or weighted MAPE for datasets with values near zero.
Ranking metrics: for recommendations and ordered outputs
MAP@k (Mean Average Precision at k)
What it measures: For each user, compute the precision at each position where a relevant item appears in the top k results, then average. A MAP@10 of 0.35 means that on average, relevant items appear in the top 10 and are reasonably well-positioned.
When to use: Product recommendations, content suggestions, search ranking. Any task where you present an ordered list and care about both which items appear and where they appear.
NDCG@k (Normalized Discounted Cumulative Gain at k)
What it measures: Similar to MAP@k but uses a logarithmic discount so items ranked lower contribute progressively less to the score. NDCG@k handles graded relevance (not just binary relevant/not relevant).
When to use: When relevance has degrees. A 5-star product shown first is better than a 3-star product shown first. NDCG captures this distinction; MAP does not.
regression_and_ranking_metrics
| Metric | What It Measures | Use When | Watch Out For |
|---|---|---|---|
| MAE | Average absolute error | General regression | Hides tail errors |
| RMSE | Root mean squared error | Large errors are costly | Dominated by outliers |
| MAPE | Average percentage error | Scale-independent comparison | Explodes near zero values |
| MAP@k | Ranking quality (top k) | Recommendations, search | Binary relevance only |
| NDCG@k | Graded ranking quality | Recommendations with ratings | Position weighting assumptions |
Choose regression metrics based on your error cost function. Choose ranking metrics based on whether relevance is binary or graded.
The evaluation mistake that costs millions
The single most common evaluation mistake in enterprise ML is using random train/test splits on time-series or sequential data. This is not a minor technicality. It inflates model accuracy by 5 to 20 percentage points and leads teams to deploy models that perform far worse in production than in evaluation.
How random splits cause data leakage
Consider a churn prediction model. With a random 80/20 split, the training set contains data from January through December 2025, and so does the test set. The model can learn patterns from December data and apply them to predict October outcomes. In production, you can only predict the future from the past, never the reverse.
random_split_vs_temporal_split
| Split Method | Training Data | Test Data | AUROC | Production AUROC |
|---|---|---|---|---|
| Random 80/20 | Jan-Dec 2025 (random 80%) | Jan-Dec 2025 (random 20%) | 0.88 | 0.72 |
| Temporal | Jan-Jun 2025 | Jul-Dec 2025 | 0.74 | 0.73 |
The random split inflates AUROC by 16 points (0.88 vs 0.72 in production) because future patterns leak into training. The temporal split matches production performance within 1 point.
The RelBench benchmark enforces temporal splits across all 30 tasks. Training data comes from one time period, validation from a later period, and test from the latest period. No future information leaks into training. This is why RelBench numbers are lower than results published on Kaggle-style benchmarks and closer to what you will see in production.
Common evaluation mistakes
- Random train/test splits on time-series data
- Single aggregate metric hides slice-level failures
- Evaluating on one dataset, deploying on different distribution
- Ignoring cold-start entity performance
- Reporting test accuracy without temporal splits
Proper evaluation practice
- Temporal splits: train on past, test on future
- Report metrics per slice (cold-start, power users, etc.)
- Evaluate on the distribution you will deploy on
- Separate cold-start metrics from established entities
- Use RelBench-style multi-task, temporal evaluation
How RelBench changed model evaluation
Before RelBench, there was no standard benchmark for ML on relational databases. Teams evaluated models on single-table Kaggle datasets (Titanic, house prices, credit default) that bear no resemblance to enterprise relational databases with 10 to 50 interconnected tables.
RelBench, published at NeurIPS 2024 by researchers at Stanford and Kumo.ai, established four principles for relational ML evaluation.
1. Multi-table structure
All 7 databases in RelBench have 3 to 15 interconnected tables. Tasks require information from multiple tables to solve well. This forces models to handle the relational structure rather than operating on pre-flattened data.
2. Temporal splits
Every task uses temporal splits. Training, validation, and test periods are non-overlapping time windows. Models cannot use future information to predict past outcomes. This eliminates the most common source of inflated benchmark results.
3. Multi-task evaluation
A single database supports multiple prediction tasks. The Amazon dataset has product rating prediction, user churn, and item-to-item recommendation. The clinical trial dataset has patient outcome prediction, adverse event detection, and treatment response forecasting. This tests whether an approach generalizes across tasks, not just optimizes one.
4. Scale
With 103 million rows across 7 databases, RelBench tests whether approaches work at enterprise scale. Algorithms that perform well on 10,000-row toy datasets often fail at 10 million rows due to memory limitations or computational cost.
relbench_results_by_approach
| Model | Avg AUROC | Evaluation Method | Feature Engineering | Training |
|---|---|---|---|---|
| LightGBM (manual FE) | 62.44 | Temporal splits | 12.3 hrs / 878 LOC | Per task |
| Llama 3.2 3B (LLM) | 68.06 | Temporal splits | None (serialized) | Pre-trained |
| Supervised GNN | 75.83 | Temporal splits | None (graph) | Per task |
| KumoRFM zero-shot | 76.71 | Temporal splits | None (graph) | None |
| KumoRFM fine-tuned | 81.14 | Temporal splits | None (graph) | 2-8 hours |
All results from the same benchmark (7 databases, 30 tasks, 103M+ rows) with the same temporal splits. Directly comparable.
PQL Query
PREDICT churn_30d
FOR EACH customers.customer_id
EVALUATE WITH temporal_split('2025-01-01', '2025-04-01', '2025-07-01')PQL supports built-in temporal evaluation. The model trains on data before Jan 2025, validates on Jan-Apr, and tests on Apr-Jul. No future information leaks into training.
Output
| split | AUROC | precision@10% | recall@10% | samples |
|---|---|---|---|---|
| validation | 0.81 | 0.74 | 0.42 | 145,000 |
| test | 0.79 | 0.71 | 0.39 | 152,000 |
| delta | -0.02 | -0.03 | -0.03 | -- |
Building your evaluation framework
For enterprise deployment, a single number is never sufficient. Here is the evaluation framework that production ML teams use.
Step 1: Choose your primary metric based on the business cost
Map the prediction task to a cost model. For fraud detection, calculate the cost of a false positive (customer friction, investigation time) and false negative (financial loss). This tells you whether to optimize for precision, recall, or a weighted combination.
Step 2: Evaluate on temporal splits
Split your data by time, not randomly. Train on data before a cutoff date, validate on the next period, test on the most recent period. If your model's test AUROC is more than 5 points higher than what you see in production, your splits are likely leaking information.
Step 3: Report slice-level metrics
Break your evaluation into meaningful segments: new customers vs. established, high-value vs. low-value, different product categories or geographies. A model with 0.80 overall AUROC might have 0.90 on established customers and 0.55 on new customers. The 0.55 is where you are losing money.
slice_level_evaluation_example
| Customer Segment | Count | AUROC | Precision@10% | Recall@10% | Revenue at Risk |
|---|---|---|---|---|---|
| Overall | 500,000 | 0.80 | 0.72 | 0.38 | $120M |
| Established (2+ years) | 320,000 | 0.89 | 0.84 | 0.45 | $78M |
| New (< 6 months) | 85,000 | 0.55 | 0.31 | 0.12 | $28M |
| Enterprise tier | 12,000 | 0.83 | 0.78 | 0.41 | $52M |
| SMB tier | 83,000 | 0.62 | 0.48 | 0.22 | $14M |
The overall 0.80 AUROC hides a 0.55 on new customers and 0.62 on SMBs. New customers represent $28M in at-risk revenue with only 12% recall -- most churn is undetected in this segment.
Step 4: Measure stability over time
Run your evaluation on multiple consecutive time periods, not just one. A model that achieves 0.82 AUROC in January, 0.78 in February, and 0.65 in March has a drift problem that a single-period evaluation would miss. Report the mean and standard deviation across periods.
Step 5: Compare against the right baseline
Your baseline should be the current production model or business heuristic, not random guessing. If the current churn model has 0.72 AUROC and the new model has 0.78, the relevant improvement is 6 points, not 28 points above random. Convert metric improvements to dollar values using your cost model from Step 1.
From metrics to business impact
The gap between a good metric and a good business outcome is smaller than most people think, but only if you measure the right thing. A 5-point AUROC improvement on fraud detection at a financial institution processing $100 billion in annual transactions translates to tens of millions in recovered losses. A 3-point improvement on churn prediction at a SaaS company with $500M ARR and 15% annual churn translates to $2M to $5M in retained revenue.
The foundation model results on RelBench are not just benchmark numbers. They represent the accuracy floor for relational predictions without any feature engineering or model training. Every point above your current baseline is money: recovered fraud, retained customers, optimized inventory, higher conversion rates.
Measure correctly. Use temporal splits. Report slice-level results. Then convert the delta to dollars. That is how you turn model evaluation from a technical exercise into a business decision.