What is AUROC and when should I use it?

AUROC (Area Under the Receiver Operating Characteristic curve) measures a classification model's ability to rank positive examples above negative ones across all possible thresholds. It ranges from 0.5 (random guessing) to 1.0 (perfect separation). Use AUROC when you care about ranking quality and your classes are not extremely imbalanced. For datasets where less than 1% of examples are positive (fraud, rare disease), use AUPRC instead.

Why do temporal splits matter for model evaluation?

Temporal splits simulate real deployment: the model trains on past data and predicts the future. Random splits allow data leakage where future information bleeds into training data, inflating accuracy by 5 to 20 percentage points. RelBench enforces temporal splits across all 30 tasks, which is why its benchmark numbers are lower but more realistic than results from random-split evaluations.

What is the difference between precision and recall?

Precision is the fraction of predicted positives that are actually positive (how many of your alerts are real). Recall is the fraction of actual positives that you predicted (how many real positives you caught). High precision means few false alarms. High recall means few missed cases. You always trade one for the other. The right balance depends on the cost of false positives vs. false negatives in your specific use case.

How does RelBench differ from standard ML benchmarks?

RelBench evaluates ML on multi-table relational databases with temporal splits, whereas standard benchmarks (UCI, Kaggle) use single flat tables with random splits. RelBench includes 7 databases, 30 tasks, and over 103 million rows. Tasks span classification and regression across e-commerce, healthcare, social networks, and more. This makes it the first benchmark that reflects how enterprise data actually looks.

What evaluation metric should I use for recommendation models?

Use MAP@k (Mean Average Precision at k) or NDCG@k (Normalized Discounted Cumulative Gain at k) for recommendation models. MAP@k measures how many relevant items appear in the top k results and how high they rank. NDCG@k adds position weighting so items ranked higher contribute more. For most e-commerce and content recommendations, k=10 or k=20 matches what users actually see.

How to Evaluate ML Models: Metrics, Benchmarks, and What Actually Matters | Kumo.ai

A model with 95% accuracy sounds impressive. Until you learn that 95% of the data belongs to one class, and the model just predicted that class every time. A model with 0.82 AUROC sounds mediocre. Until you learn it was evaluated with temporal splits on a 30-task benchmark covering 103 million rows of real-world relational data.

Model evaluation is where most ML projects go wrong. Not because teams pick the wrong metric, but because they pick the right metric and measure it the wrong way. Random splits inflate accuracy. Single-task evaluations hide generalization failures. Aggregate numbers obscure performance on the slices that matter most.

This guide covers the metrics you need to know, the evaluation mistakes that cost real money, and how the RelBench benchmark established a new standard for evaluating ML on relational data.

Classification metrics: what each one reveals and hides

Accuracy

What it measures: The fraction of predictions that are correct. A model that predicts 90 out of 100 labels correctly has 90% accuracy.

What it hides: Class imbalance. In fraud detection, where 0.1% of transactions are fraudulent, a model that labels everything as legitimate achieves 99.9% accuracy while catching zero fraud. Accuracy is useful only when classes are roughly balanced, which is rare in enterprise ML.

Precision

What it measures: Of all the cases the model flagged as positive, how many actually were positive. If the model flags 100 transactions as fraud and 80 are actually fraudulent, precision is 80%.

What it hides: Missed positives. A model that only flags the 10 most obvious fraud cases might achieve 100% precision while missing 990 out of 1,000 actual fraud cases. High precision tells you the model's alerts are reliable, not that it catches everything.

When to prioritize: When false positives are expensive. Marketing campaigns where each contacted lead costs $50. Account lockouts where false positives drive customer attrition. Medical diagnoses where false positives trigger invasive procedures.

Recall (sensitivity)

What it measures: Of all the actual positive cases, how many did the model catch. If there are 1,000 fraudulent transactions and the model catches 700, recall is 70%.

What it hides: False positive volume. A model that flags every transaction as fraud achieves 100% recall while generating millions of false alerts. High recall tells you the model is comprehensive, not that it is precise.

When to prioritize: When missed positives are catastrophic. Anti-money laundering where missing a suspicious transaction can result in regulatory fines of $10M or more. Cancer screening where a missed diagnosis has irreversible consequences. Security intrusion detection where one missed breach compromises the entire system.

F1 score

What it measures: The harmonic mean of precision and recall. It balances both concerns into a single number between 0 and 1.

What it hides: Which side of the trade-off matters more to your business. An F1 of 0.75 could come from precision 0.90 and recall 0.64, or from precision 0.64 and recall 0.90. Those are very different models with very different operational consequences. F1 is convenient for model comparison but insufficient for production deployment decisions.

AUROC

What it measures: The probability that the model ranks a randomly chosen positive example higher than a randomly chosen negative example. AUROC ranges from 0.5 (random) to 1.0 (perfect ranking). It evaluates the model across all possible classification thresholds simultaneously.

What it hides: Performance at specific operating points. An AUROC of 0.85 tells you the model has good overall ranking ability, but not whether it works well at the threshold your application uses. Two models with identical AUROC can have very different precision-recall curves.

When to use: AUROC is the standard metric for comparing models when you have not yet decided on an operating threshold. RelBench uses AUROC as its primary classification metric across all 30 tasks. On that benchmark, KumoRFM achieves 76.71 zero-shot and 81.14 fine-tuned, vs. 75.83 for a supervised GNN and 62.44 for LightGBM with manual features.

classification_metrics_comparison

Metric	Range	Best For	Misleads When	RelBench Standard
Accuracy	0-100%	Balanced classes	Imbalanced data (99.9% = trivial)	No
Precision	0-1.0	Cost of false positives high	Model flags very few cases	No
Recall	0-1.0	Cost of missed cases high	Model flags everything	No
F1 Score	0-1.0	Balancing precision/recall	Hides which side matters more	No
AUROC	0.5-1.0	Model comparison, ranking	Extreme class imbalance	Yes
AUPRC	0-1.0	Rare event detection	Balanced datasets	Supplementary

No single metric tells the full story. AUROC is the standard for model comparison; precision/recall matter at your chosen operating threshold.

AUPRC (average precision)

What it measures: The area under the precision-recall curve. Unlike AUROC, AUPRC is sensitive to class imbalance. For a dataset with 0.1% positive rate, random guessing gives AUROC of 0.5 but AUPRC of only 0.001.

When to use: Any task with severe class imbalance: fraud detection (0.1% fraud rate), rare disease prediction (0.01% prevalence), anomaly detection. AUPRC gives a more honest picture of model performance in these scenarios.

Regression metrics: beyond the average error

MAE (Mean Absolute Error)

What it measures: The average absolute difference between predicted and actual values. If you predict customer lifetime value and your MAE is $50, your predictions are off by $50 on average.

What it hides: The distribution of errors. An MAE of $50 could mean every prediction is off by exactly $50, or it could mean 90% are off by $5 and 10% are off by $455. For business decisions, the tail matters more than the average.

RMSE (Root Mean Squared Error)

What it measures: The square root of the average squared error. RMSE penalizes large errors more than MAE. If most predictions are close but a few are wildly off, RMSE will be much higher than MAE.

When to use: When large errors are disproportionately costly. Demand forecasting where overestimating by 10,000 units is far worse than overestimating by 100 units. Financial risk where tail errors represent systemic risk.

MAPE (Mean Absolute Percentage Error)

What it measures: The average percentage error, making it scale-independent. A MAPE of 10% means predictions are off by 10% on average regardless of whether the underlying values are $10 or $10,000.

What it hides: MAPE explodes when actual values are near zero. A prediction of $5 for an actual value of $1 gives 400% error, which dominates the average. Use symmetric MAPE (sMAPE) or weighted MAPE for datasets with values near zero.

Ranking metrics: for recommendations and ordered outputs

MAP@k (Mean Average Precision at k)

What it measures: For each user, compute the precision at each position where a relevant item appears in the top k results, then average. A MAP@10 of 0.35 means that on average, relevant items appear in the top 10 and are reasonably well-positioned.

When to use: Product recommendations, content suggestions, search ranking. Any task where you present an ordered list and care about both which items appear and where they appear.

NDCG@k (Normalized Discounted Cumulative Gain at k)

What it measures: Similar to MAP@k but uses a logarithmic discount so items ranked lower contribute progressively less to the score. NDCG@k handles graded relevance (not just binary relevant/not relevant).

When to use: When relevance has degrees. A 5-star product shown first is better than a 3-star product shown first. NDCG captures this distinction; MAP does not.

regression_and_ranking_metrics

Metric	What It Measures	Use When	Watch Out For
MAE	Average absolute error	General regression	Hides tail errors
RMSE	Root mean squared error	Large errors are costly	Dominated by outliers
MAPE	Average percentage error	Scale-independent comparison	Explodes near zero values
MAP@k	Ranking quality (top k)	Recommendations, search	Binary relevance only
NDCG@k	Graded ranking quality	Recommendations with ratings	Position weighting assumptions

Choose regression metrics based on your error cost function. Choose ranking metrics based on whether relevance is binary or graded.

The evaluation mistake that costs millions

The single most common evaluation mistake in enterprise ML is using random train/test splits on time-series or sequential data. This is not a minor technicality. It inflates model accuracy by 5 to 20 percentage points and leads teams to deploy models that perform far worse in production than in evaluation.

How random splits cause data leakage

Consider a churn prediction model. With a random 80/20 split, the training set contains data from January through December 2025, and so does the test set. The model can learn patterns from December data and apply them to predict October outcomes. In production, you can only predict the future from the past, never the reverse.

random_split_vs_temporal_split

Split Method	Training Data	Test Data	AUROC	Production AUROC
Random 80/20	Jan-Dec 2025 (random 80%)	Jan-Dec 2025 (random 20%)	0.88	0.72
Temporal	Jan-Jun 2025	Jul-Dec 2025	0.74	0.73

The random split inflates AUROC by 16 points (0.88 vs 0.72 in production) because future patterns leak into training. The temporal split matches production performance within 1 point.

The RelBench benchmark enforces temporal splits across all 30 tasks. Training data comes from one time period, validation from a later period, and test from the latest period. No future information leaks into training. This is why RelBench numbers are lower than results published on Kaggle-style benchmarks and closer to what you will see in production.

Common evaluation mistakes

Random train/test splits on time-series data
Single aggregate metric hides slice-level failures
Evaluating on one dataset, deploying on different distribution
Ignoring cold-start entity performance
Reporting test accuracy without temporal splits

Proper evaluation practice

Temporal splits: train on past, test on future
Report metrics per slice (cold-start, power users, etc.)
Evaluate on the distribution you will deploy on
Separate cold-start metrics from established entities
Use RelBench-style multi-task, temporal evaluation

How RelBench changed model evaluation

Before RelBench, there was no standard benchmark for ML on relational databases. Teams evaluated models on single-table Kaggle datasets (Titanic, house prices, credit default) that bear no resemblance to enterprise relational databases with 10 to 50 interconnected tables.

RelBench, published at NeurIPS 2024 by researchers at Stanford and Kumo.ai, established four principles for relational ML evaluation.

1. Multi-table structure

All 7 databases in RelBench have 3 to 15 interconnected tables. Tasks require information from multiple tables to solve well. This forces models to handle the relational structure rather than operating on pre-flattened data.

2. Temporal splits

Every task uses temporal splits. Training, validation, and test periods are non-overlapping time windows. Models cannot use future information to predict past outcomes. This eliminates the most common source of inflated benchmark results.

3. Multi-task evaluation

A single database supports multiple prediction tasks. The Amazon dataset has product rating prediction, user churn, and item-to-item recommendation. The clinical trial dataset has patient outcome prediction, adverse event detection, and treatment response forecasting. This tests whether an approach generalizes across tasks, not just optimizes one.

4. Scale

With 103 million rows across 7 databases, RelBench tests whether approaches work at enterprise scale. Algorithms that perform well on 10,000-row toy datasets often fail at 10 million rows due to memory limitations or computational cost.

relbench_results_by_approach

Model	Avg AUROC	Evaluation Method	Feature Engineering	Training
LightGBM (manual FE)	62.44	Temporal splits	12.3 hrs / 878 LOC	Per task
Llama 3.2 3B (LLM)	68.06	Temporal splits	None (serialized)	Pre-trained
Supervised GNN	75.83	Temporal splits	None (graph)	Per task
KumoRFM zero-shot	76.71	Temporal splits	None (graph)	None
KumoRFM fine-tuned	81.14	Temporal splits	None (graph)	2-8 hours

All results from the same benchmark (7 databases, 30 tasks, 103M+ rows) with the same temporal splits. Directly comparable.

PQL Query

PREDICT churn_30d
FOR EACH customers.customer_id
EVALUATE WITH temporal_split('2025-01-01', '2025-04-01', '2025-07-01')

PQL supports built-in temporal evaluation. The model trains on data before Jan 2025, validates on Jan-Apr, and tests on Apr-Jul. No future information leaks into training.

Output

split	AUROC	precision@10%	recall@10%	samples
validation	0.81	0.74	0.42	145,000
test	0.79	0.71	0.39	152,000
delta	-0.02	-0.03	-0.03	--

Building your evaluation framework

For enterprise deployment, a single number is never sufficient. Here is the evaluation framework that production ML teams use.

Step 1: Choose your primary metric based on the business cost

Map the prediction task to a cost model. For fraud detection, calculate the cost of a false positive (customer friction, investigation time) and false negative (financial loss). This tells you whether to optimize for precision, recall, or a weighted combination.

Step 2: Evaluate on temporal splits

Split your data by time, not randomly. Train on data before a cutoff date, validate on the next period, test on the most recent period. If your model's test AUROC is more than 5 points higher than what you see in production, your splits are likely leaking information.

Step 3: Report slice-level metrics

Break your evaluation into meaningful segments: new customers vs. established, high-value vs. low-value, different product categories or geographies. A model with 0.80 overall AUROC might have 0.90 on established customers and 0.55 on new customers. The 0.55 is where you are losing money.

slice_level_evaluation_example

Customer Segment	Count	AUROC	Precision@10%	Recall@10%	Revenue at Risk
Overall	500,000	0.80	0.72	0.38	$120M
Established (2+ years)	320,000	0.89	0.84	0.45	$78M
New (< 6 months)	85,000	0.55	0.31	0.12	$28M
Enterprise tier	12,000	0.83	0.78	0.41	$52M
SMB tier	83,000	0.62	0.48	0.22	$14M

The overall 0.80 AUROC hides a 0.55 on new customers and 0.62 on SMBs. New customers represent $28M in at-risk revenue with only 12% recall -- most churn is undetected in this segment.

Step 4: Measure stability over time

Run your evaluation on multiple consecutive time periods, not just one. A model that achieves 0.82 AUROC in January, 0.78 in February, and 0.65 in March has a drift problem that a single-period evaluation would miss. Report the mean and standard deviation across periods.

Step 5: Compare against the right baseline

Your baseline should be the current production model or business heuristic, not random guessing. If the current churn model has 0.72 AUROC and the new model has 0.78, the relevant improvement is 6 points, not 28 points above random. Convert metric improvements to dollar values using your cost model from Step 1.

From metrics to business impact

The gap between a good metric and a good business outcome is smaller than most people think, but only if you measure the right thing. A 5-point AUROC improvement on fraud detection at a financial institution processing $100 billion in annual transactions translates to tens of millions in recovered losses. A 3-point improvement on churn prediction at a SaaS company with $500M ARR and 15% annual churn translates to $2M to $5M in retained revenue.

The foundation model results on RelBench are not just benchmark numbers. They represent the accuracy floor for relational predictions without any feature engineering or model training. Every point above your current baseline is money: recovered fraud, retained customers, optimized inventory, higher conversion rates.

Measure correctly. Use temporal splits. Report slice-level results. Then convert the delta to dollars. That is how you turn model evaluation from a technical exercise into a business decision.

Key Takeaways

1No single metric tells the full story. AUROC measures ranking quality across thresholds. Precision and recall matter at your chosen operating point. Use AUPRC for rare events (fraud, rare disease) where class imbalance makes AUROC misleading.
2The biggest evaluation mistake is random train/test splits on temporal data. This inflates AUROC by 5-20 points through data leakage. RelBench enforces temporal splits, which is why its numbers are lower but more predictive of production performance.
3RelBench established four principles for relational ML evaluation: multi-table structure (3-15 tables), temporal splits (no future leakage), multi-task evaluation (multiple tasks per database), and enterprise scale (103M+ rows). It is the standard benchmark.
4On RelBench with proper temporal evaluation, KumoRFM zero-shot achieves 76.71 AUROC vs. 62.44 for LightGBM with manual features. The 14-point gap reflects information lost during feature engineering, not model quality differences.
5Convert metric improvements to dollars. A 5-point AUROC improvement on fraud detection at $100B transaction volume translates to tens of millions in recovered losses. Build this cost model before evaluating any model.

How to Evaluate ML Models: Metrics, Benchmarks, and What Actually Matters