Churn prediction is the most common ML use case in enterprise. Every subscription business, every marketplace, every SaaS company has a version of the same question: which customers are about to leave?
The answer is usually the same too. A data scientist spends two weeks writing SQL joins, building aggregate features, training a gradient boosted tree, and deploying it behind a feature store. Two months later, the model drifts and someone rebuilds it.
There are now three fundamentally different ways to solve this problem. They differ not in the model architecture but in how they consume data. Here is the same churn prediction task solved three ways, with real code and real numbers.
saas_customers
| customer_id | plan | MRR | tenure_months | contract_type |
|---|---|---|---|---|
| C-5501 | Enterprise | $4,200 | 18 | Annual |
| C-5502 | Pro | $890 | 7 | Monthly |
| C-5503 | Enterprise | $6,800 | 24 | Annual |
| C-5504 | Starter | $149 | 3 | Monthly |
| C-5505 | Pro | $890 | 11 | Monthly |
usage_events (last 30 days)
| customer_id | logins | API_calls | features_used | support_tickets | trend |
|---|---|---|---|---|---|
| C-5501 | 142 | 28,400 | 12 of 15 | 0 | Stable |
| C-5502 | 8 | 340 | 3 of 10 | 4 | Declining (-60%/week) |
| C-5503 | 89 | 18,200 | 9 of 15 | 2 | Stable |
| C-5504 | 31 | 1,200 | 6 of 8 | 0 | Growing (+40%/week) |
| C-5505 | 3 | 42 | 1 of 10 | 6 | Near-zero (was 45 logins/mo) |
Highlighted: C-5502 is declining rapidly with rising support tickets. C-5505 has nearly stopped using the product. A flat feature table showing 'logins_30d = 8' and 'logins_30d = 3' misses that C-5502 is accelerating downward while C-5505 has already flatlined.
The setup
Imagine a B2B SaaS company with a standard relational database. Five tables: customers, subscriptions, usage_events, support_tickets, and invoices. The goal: predict which customers will churn in the next 30 days.
This is a textbook scenario. The signal is distributed across all five tables. A customer who logged in 50 times last month but filed 8 support tickets and downgraded their plan is very different from one who logged in 50 times with no tickets and an annual contract. The patterns that predict churn live in the relationships between these tables, not in any single one.
Approach 1: Traditional ML
What you build
The traditional approach requires flattening all five tables into a single feature table with one row per customer. This means writing SQL to join, aggregate, and compute features like:
total_logins_last_30d,total_logins_last_90davg_session_duration_7d,avg_session_duration_30dnum_support_tickets_30d,avg_ticket_severitydays_since_last_login,days_since_last_paymentsubscription_tier,months_as_customer,contract_typeinvoice_amount_trend_90d,payment_delay_count
A typical churn model uses 80 to 200 features like these. Each one requires a SQL query with JOINs, GROUP BYs, and window functions. Then you need to handle nulls, normalize numerical columns, encode categoricals, and split by time to avoid leakage.
After all that, you train a LightGBM or XGBoost model, tune hyperparameters, evaluate on a holdout set, and push it to production.
What it costs
The Stanford RelBench study measured this rigorously. Experienced data scientists spent an average of 12.3 hours and wrote 878 lines of SQL and Python per prediction task. The standard deviation was 77 lines, meaning even experts diverge significantly in their approach.
Multiply that by the 15 to 40 prediction tasks a typical enterprise data science team maintains, and you are looking at person-months of feature engineering work just to keep models current.
Approach 2: Relational Deep Learning (RDL)
What you build
Relational Deep Learning, published at ICML 2024, takes a different approach. Instead of flattening the database into a feature table, you represent it as a temporal heterogeneous graph. Each row becomes a node. Each foreign key relationship becomes an edge. Timestamps are preserved as temporal attributes.
Using PyTorch Geometric and the RelBench framework, the setup looks roughly like this: you define your database schema, specify the prediction target (churn within 30 days for the customers table), and let a graph neural network learn which cross-table patterns are predictive. The GNN passes messages along edges (foreign key relationships), aggregating information from neighboring nodes across multiple hops.
The code is approximately 56 lines of Python. You load the dataset, define the task, configure a heterogeneous GNN (typically a GraphSAGE or GAT variant), train for a few epochs, and evaluate.
What changes
No feature engineering at all. The GNN discovers which patterns across tables are predictive by learning message-passing functions over the graph structure. It can find multi-hop signals (a customer's churn risk depends on the satisfaction scores of products they bought, which depends on the return rates of those products) that no human would enumerate.
The time drops from 12.3 hours to roughly 30 minutes, mostly spent on data loading and training. The accuracy improves because the model has access to the full relational structure. On the RelBench benchmark, the GNN baseline achieves 75.83 AUROC on classification tasks, compared to 62.44 for LightGBM with manual features.
What stays the same
You still need to train a model from scratch for each prediction task. Want to predict churn? Train a GNN. Want to predict upsell? Train another GNN. Want to predict support ticket escalation? Train another one. The feature engineering is gone, but the per-task training cycle remains.
Approach 3: Foundation model (KumoRFM)
What you build
KumoRFM is a foundation model pre-trained on billions of relational patterns across thousands of diverse databases. It has already learned the universal structures that recur across relational data: recency effects, frequency patterns, temporal dynamics, graph topology, cross-table propagation.
To predict churn, you write one line of PQL (Predictive Query Language):
PREDICT churn FOR customers WHERE subscription.status = 'active' WITHIN 30 days
That is it. No feature engineering. No model training. No pipeline. The foundation model reads your database schema, constructs the temporal graph internally, and returns predictions.
What changes
Everything. The time drops from 12.3 hours to under 1 second. The code drops from 878 lines to 1 line. And because the foundation model has seen thousands of databases, it generalizes to your data without any task-specific training.
On the RelBench benchmark, KumoRFM zero-shot achieves 76.71 AUROC, outperforming both the manual feature engineering approach (62.44) and the supervised GNN baseline (75.83). Fine-tuning on your data pushes this to 81.14 AUROC.
Traditional ML pipeline
- 12.3 hours of data scientist time
- 878 lines of SQL and Python
- 80-200 hand-crafted features
- Rebuild from scratch for each task
- 62.44 AUROC on RelBench classification
Foundation model (KumoRFM)
- Under 1 second to prediction
- 1 line of PQL
- Full relational structure preserved
- Same model handles any prediction task
- 76.71 AUROC zero-shot, 81.14 fine-tuned
approach_comparison
| dimension | Traditional ML | RDL (PyG) | KumoRFM |
|---|---|---|---|
| Time to first prediction | 12.3 hours | ~30 minutes | <1 second |
| Lines of code | 878 | 56 | 1 |
| Feature engineering required | Yes (80% of time) | No | No |
| Per-task training required | Yes | Yes | No (zero-shot) |
| AUROC (RelBench classification) | 62.44 | 75.83 | 76.71 (81.14 fine-tuned) |
| Multi-hop pattern discovery | Manual only | Automatic | Automatic + pre-trained |
The 14-point AUROC gap between traditional ML (62.44) and KumoRFM (76.71) is not about model architecture. It is about data access: the foundation model sees the full relational structure that flattening destroys.
PQL Query
PREDICT churn FOR EACH customers.customer_id WHERE subscriptions.status = 'active' WITHIN 30 days
One line of PQL replaces the entire traditional ML pipeline: SQL joins, feature computation, model training, and deployment. The foundation model reads all five tables directly.
Output
| customer_id | churn_prob | top_signal | urgency |
|---|---|---|---|
| C-5505 | 0.94 | Usage collapsed + 6 tickets + monthly contract | Immediate |
| C-5502 | 0.82 | Usage declining 60%/week + 4 tickets | This week |
| C-5503 | 0.21 | 2 tickets but usage stable, annual lock-in | Monitor |
| C-5504 | 0.09 | New but usage growing 40%/week | Low |
| C-5501 | 0.04 | High engagement, annual contract, 0 tickets | Low |
Where the accuracy difference comes from
The 14-point AUROC gap between LightGBM with manual features (62.44) and KumoRFM zero-shot (76.71) is not about the model architecture. It is about the data the model can see.
support_tickets (raw relational data)
| ticket_id | customer_id | date | category | severity | resolved_hours |
|---|---|---|---|---|---|
| T-901 | C-5502 | Feb 18 | Bug report | High | 48 |
| T-902 | C-5502 | Feb 22 | Bug report | Critical | 72 |
| T-903 | C-5502 | Mar 1 | Cancellation request | Critical | Pending |
| T-904 | C-5505 | Feb 5 | Feature request | Low | 4 |
| T-905 | C-5505 | Feb 28 | Bug report | High | 36 |
| T-906 | C-5505 | Mar 5 | Bug report | Critical | Pending |
| T-907 | C-5505 | Mar 8 | Bug report | Critical | Pending |
| T-908 | C-5505 | Mar 10 | Bug report | Critical | Pending |
| T-909 | C-5505 | Mar 12 | Cancellation request | Critical | Pending |
Highlighted: both C-5502 and C-5505 escalated from feature requests / bug reports to cancellation requests. The severity trajectory (Low to High to Critical) and the category shift (bugs to cancellation) tell the churn story.
flat_feature_table (what LightGBM sees)
| customer_id | tickets_30d | avg_severity | logins_30d | api_calls_30d | churn_signal |
|---|---|---|---|---|---|
| C-5502 | 4 | 2.5 | 8 | 340 | Moderate (by numbers) |
| C-5505 | 6 | 3.2 | 3 | 42 | High (by numbers) |
| C-5504 | 0 | 0 | 31 | 1,200 | Low |
The flat table reduces 9 ticket rows into 'tickets_30d' counts and 'avg_severity'. It cannot see that C-5502's tickets escalated from 'Bug report' to 'Cancellation request'. It cannot see that C-5505 filed 5 critical bugs in 2 weeks. The temporal escalation pattern is destroyed.
A LightGBM model sees whatever features you build. If you wroteavg_logins_last_30d but not login_frequency_acceleration, the model cannot use that signal. If you aggregated support tickets into a count but lost the temporal sequence (three tickets in one day vs one per week), that pattern is gone.
A foundation model sees the raw relational structure. Every row, every timestamp, every foreign key relationship. It discovers patterns that span multiple tables and time horizons, including the multi-hop and temporal patterns that humans do not enumerate.
The accuracy difference is a coverage difference. The model that sees more data finds more patterns.
When to use which approach
The three approaches are not equally suited to every situation.
Traditional ML still works when
- You have a single flat table with well-understood features
- The prediction task is simple and does not span multiple tables
- You have a mature feature store and the features already exist
- Regulatory requirements demand that every feature be explicitly interpretable
RDL makes sense when
- You have multi-table data and need higher accuracy than manual features can deliver
- You have ML engineering capacity to train and deploy GNNs
- You want to invest in a single high-value prediction task (not dozens)
Foundation models are the right choice when
- You have relational data across multiple tables and need predictions fast
- You have dozens of prediction tasks and cannot staff a team to build each one manually
- Speed to first prediction matters more than squeezing the last 0.5% of accuracy (though fine-tuning closes that gap)
- Your data science team is drowning in feature engineering and you want them working on business problems instead
The bottom line
Churn prediction has not changed in a decade. The databases got bigger, the models got slightly better, but the process stayed the same: flatten, aggregate, engineer, train, deploy, repeat. The bottleneck was never the model. It was the 12.3 hours of feature engineering that came before it.
Relational deep learning removed the feature engineering step. Foundation models removed the training step. What is left is a single query that returns predictions from raw relational data in under a second.
The question is not whether this approach works. The benchmarks are clear. The question is how much longer you want your data science team spending 80% of their time writing SQL joins.