Can you predict churn without building features manually?

Yes. Relational deep learning and foundation models like KumoRFM learn directly from raw relational tables. They represent your database as a temporal graph and discover predictive patterns across tables automatically, eliminating the need to hand-engineer features like days_since_last_order or avg_spend_90d.

How accurate is churn prediction without feature engineering?

On the RelBench benchmark, KumoRFM zero-shot achieves 76.71 AUROC on classification tasks, outperforming LightGBM with manually engineered features (62.44 AUROC) and matching or exceeding supervised GNN baselines (75.83 AUROC). Fine-tuning pushes accuracy to 81.14 AUROC.

What is PQL (Predictive Query Language)?

PQL is a query language developed by Kumo.ai that lets you express prediction tasks in a single line. Instead of writing hundreds of lines of SQL and Python to engineer features and train a model, you write a statement like PREDICT churn FOR customers WITHIN 30 days. The foundation model handles everything else.

How long does it take to build a churn model with KumoRFM vs traditional ML?

Traditional ML takes an average of 12.3 hours and 878 lines of code per prediction task, according to a Stanford study. The RDL approach with PyTorch Geometric reduces that to roughly 30 minutes and 56 lines of Python. KumoRFM reduces it to 1 second and 1 line of PQL.

Do I need to flatten my database into a single table for churn prediction?

With traditional ML, yes. You must join and aggregate multiple tables into a flat feature table with one row per customer. With relational deep learning or KumoRFM, no. The model reads your tables directly, preserving the multi-table structure, temporal sequences, and graph relationships that flattening destroys.

How to Predict Churn Without Feature Engineering | Kumo.ai

Churn prediction is the most common ML use case in enterprise. Every subscription business, every marketplace, every SaaS company has a version of the same question: which customers are about to leave?

The answer is usually the same too. A data scientist spends two weeks writing SQL joins, building aggregate features, training a gradient boosted tree, and deploying it behind a feature store. Two months later, the model drifts and someone rebuilds it.

There are now three fundamentally different ways to solve this problem. They differ not in the model architecture but in how they consume data. Here is the same churn prediction task solved three ways, with real code and real numbers.

saas_customers

customer_id	plan	MRR	tenure_months	contract_type
C-5501	Enterprise	$4,200	18	Annual
C-5502	Pro	$890	7	Monthly
C-5503	Enterprise	$6,800	24	Annual
C-5504	Starter	$149	3	Monthly
C-5505	Pro	$890	11	Monthly

usage_events (last 30 days)

customer_id	logins	API_calls	features_used	support_tickets	trend
C-5501	142	28,400	12 of 15	0	Stable
C-5502	8	340	3 of 10	4	Declining (-60%/week)
C-5503	89	18,200	9 of 15	2	Stable
C-5504	31	1,200	6 of 8	0	Growing (+40%/week)
C-5505	3	42	1 of 10	6	Near-zero (was 45 logins/mo)

Highlighted: C-5502 is declining rapidly with rising support tickets. C-5505 has nearly stopped using the product. A flat feature table showing 'logins_30d = 8' and 'logins_30d = 3' misses that C-5502 is accelerating downward while C-5505 has already flatlined.

The setup

Imagine a B2B SaaS company with a standard relational database. Five tables: customers, subscriptions, usage_events, support_tickets, and invoices. The goal: predict which customers will churn in the next 30 days.

This is a textbook scenario. The signal is distributed across all five tables. A customer who logged in 50 times last month but filed 8 support tickets and downgraded their plan is very different from one who logged in 50 times with no tickets and an annual contract. The patterns that predict churn live in the relationships between these tables, not in any single one.

Approach 1: Traditional ML

What you build

The traditional approach requires flattening all five tables into a single feature table with one row per customer. This means writing SQL to join, aggregate, and compute features like:

total_logins_last_30d, total_logins_last_90d
avg_session_duration_7d, avg_session_duration_30d
num_support_tickets_30d, avg_ticket_severity
days_since_last_login, days_since_last_payment
subscription_tier, months_as_customer, contract_type
invoice_amount_trend_90d, payment_delay_count

A typical churn model uses 80 to 200 features like these. Each one requires a SQL query with JOINs, GROUP BYs, and window functions. Then you need to handle nulls, normalize numerical columns, encode categoricals, and split by time to avoid leakage.

After all that, you train a LightGBM or XGBoost model, tune hyperparameters, evaluate on a holdout set, and push it to production.

What it costs

The Stanford RelBench study measured this rigorously. Experienced data scientists spent an average of 12.3 hours and wrote 878 lines of SQL and Python per prediction task. The standard deviation was 77 lines, meaning even experts diverge significantly in their approach.

Multiply that by the 15 to 40 prediction tasks a typical enterprise data science team maintains, and you are looking at person-months of feature engineering work just to keep models current.

Approach 2: Relational Deep Learning (RDL)

What you build

Relational Deep Learning, published at ICML 2024, takes a different approach. Instead of flattening the database into a feature table, you represent it as a temporal heterogeneous graph. Each row becomes a node. Each foreign key relationship becomes an edge. Timestamps are preserved as temporal attributes.

Using PyTorch Geometric and the RelBench framework, the setup looks roughly like this: you define your database schema, specify the prediction target (churn within 30 days for the customers table), and let a graph neural network learn which cross-table patterns are predictive. The GNN passes messages along edges (foreign key relationships), aggregating information from neighboring nodes across multiple hops.

The code is approximately 56 lines of Python. You load the dataset, define the task, configure a heterogeneous GNN (typically a GraphSAGE or GAT variant), train for a few epochs, and evaluate.

What changes

No feature engineering at all. The GNN discovers which patterns across tables are predictive by learning message-passing functions over the graph structure. It can find multi-hop signals (a customer's churn risk depends on the satisfaction scores of products they bought, which depends on the return rates of those products) that no human would enumerate.

The time drops from 12.3 hours to roughly 30 minutes, mostly spent on data loading and training. The accuracy improves because the model has access to the full relational structure. On the RelBench benchmark, the GNN baseline achieves 75.83 AUROC on classification tasks, compared to 62.44 for LightGBM with manual features.

What stays the same

You still need to train a model from scratch for each prediction task. Want to predict churn? Train a GNN. Want to predict upsell? Train another GNN. Want to predict support ticket escalation? Train another one. The feature engineering is gone, but the per-task training cycle remains.

Approach 3: Foundation model (KumoRFM)

What you build

KumoRFM is a foundation model pre-trained on billions of relational patterns across thousands of diverse databases. It has already learned the universal structures that recur across relational data: recency effects, frequency patterns, temporal dynamics, graph topology, cross-table propagation.

To predict churn, you write one line of PQL (Predictive Query Language):

PREDICT churn FOR customers WHERE subscription.status = 'active' WITHIN 30 days

That is it. No feature engineering. No model training. No pipeline. The foundation model reads your database schema, constructs the temporal graph internally, and returns predictions.

What changes

Everything. The time drops from 12.3 hours to under 1 second. The code drops from 878 lines to 1 line. And because the foundation model has seen thousands of databases, it generalizes to your data without any task-specific training.

On the RelBench benchmark, KumoRFM zero-shot achieves 76.71 AUROC, outperforming both the manual feature engineering approach (62.44) and the supervised GNN baseline (75.83). Fine-tuning on your data pushes this to 81.14 AUROC.

Traditional ML pipeline

12.3 hours of data scientist time
878 lines of SQL and Python
80-200 hand-crafted features
Rebuild from scratch for each task
62.44 AUROC on RelBench classification

Foundation model (KumoRFM)

Under 1 second to prediction
1 line of PQL
Full relational structure preserved
Same model handles any prediction task
76.71 AUROC zero-shot, 81.14 fine-tuned

approach_comparison

dimension	Traditional ML	RDL (PyG)	KumoRFM
Time to first prediction	12.3 hours	~30 minutes	<1 second
Lines of code	878	56	1
Feature engineering required	Yes (80% of time)	No	No
Per-task training required	Yes	Yes	No (zero-shot)
AUROC (RelBench classification)	62.44	75.83	76.71 (81.14 fine-tuned)
Multi-hop pattern discovery	Manual only	Automatic	Automatic + pre-trained

The 14-point AUROC gap between traditional ML (62.44) and KumoRFM (76.71) is not about model architecture. It is about data access: the foundation model sees the full relational structure that flattening destroys.

PQL Query

PREDICT churn
FOR EACH customers.customer_id
WHERE subscriptions.status = 'active'
WITHIN 30 days

One line of PQL replaces the entire traditional ML pipeline: SQL joins, feature computation, model training, and deployment. The foundation model reads all five tables directly.

Output

customer_id	churn_prob	top_signal	urgency
C-5505	0.94	Usage collapsed + 6 tickets + monthly contract	Immediate
C-5502	0.82	Usage declining 60%/week + 4 tickets	This week
C-5503	0.21	2 tickets but usage stable, annual lock-in	Monitor
C-5504	0.09	New but usage growing 40%/week	Low
C-5501	0.04	High engagement, annual contract, 0 tickets	Low

Where the accuracy difference comes from

The 14-point AUROC gap between LightGBM with manual features (62.44) and KumoRFM zero-shot (76.71) is not about the model architecture. It is about the data the model can see.

support_tickets (raw relational data)

ticket_id	customer_id	date	category	severity	resolved_hours
T-901	C-5502	Feb 18	Bug report	High	48
T-902	C-5502	Feb 22	Bug report	Critical	72
T-903	C-5502	Mar 1	Cancellation request	Critical	Pending
T-904	C-5505	Feb 5	Feature request	Low	4
T-905	C-5505	Feb 28	Bug report	High	36
T-906	C-5505	Mar 5	Bug report	Critical	Pending
T-907	C-5505	Mar 8	Bug report	Critical	Pending
T-908	C-5505	Mar 10	Bug report	Critical	Pending
T-909	C-5505	Mar 12	Cancellation request	Critical	Pending

Highlighted: both C-5502 and C-5505 escalated from feature requests / bug reports to cancellation requests. The severity trajectory (Low to High to Critical) and the category shift (bugs to cancellation) tell the churn story.

flat_feature_table (what LightGBM sees)

customer_id	tickets_30d	avg_severity	logins_30d	api_calls_30d	churn_signal
C-5502	4	2.5	8	340	Moderate (by numbers)
C-5505	6	3.2	3	42	High (by numbers)
C-5504	0	0	31	1,200	Low

The flat table reduces 9 ticket rows into 'tickets_30d' counts and 'avg_severity'. It cannot see that C-5502's tickets escalated from 'Bug report' to 'Cancellation request'. It cannot see that C-5505 filed 5 critical bugs in 2 weeks. The temporal escalation pattern is destroyed.

A LightGBM model sees whatever features you build. If you wroteavg_logins_last_30d but not login_frequency_acceleration, the model cannot use that signal. If you aggregated support tickets into a count but lost the temporal sequence (three tickets in one day vs one per week), that pattern is gone.

A foundation model sees the raw relational structure. Every row, every timestamp, every foreign key relationship. It discovers patterns that span multiple tables and time horizons, including the multi-hop and temporal patterns that humans do not enumerate.

The accuracy difference is a coverage difference. The model that sees more data finds more patterns.

When to use which approach

The three approaches are not equally suited to every situation.

Traditional ML still works when

You have a single flat table with well-understood features
The prediction task is simple and does not span multiple tables
You have a mature feature store and the features already exist
Regulatory requirements demand that every feature be explicitly interpretable

RDL makes sense when

You have multi-table data and need higher accuracy than manual features can deliver
You have ML engineering capacity to train and deploy GNNs
You want to invest in a single high-value prediction task (not dozens)

Foundation models are the right choice when

You have relational data across multiple tables and need predictions fast
You have dozens of prediction tasks and cannot staff a team to build each one manually
Speed to first prediction matters more than squeezing the last 0.5% of accuracy (though fine-tuning closes that gap)
Your data science team is drowning in feature engineering and you want them working on business problems instead

The bottom line

Churn prediction has not changed in a decade. The databases got bigger, the models got slightly better, but the process stayed the same: flatten, aggregate, engineer, train, deploy, repeat. The bottleneck was never the model. It was the 12.3 hours of feature engineering that came before it.

Relational deep learning removed the feature engineering step. Foundation models removed the training step. What is left is a single query that returns predictions from raw relational data in under a second.

The question is not whether this approach works. The benchmarks are clear. The question is how much longer you want your data science team spending 80% of their time writing SQL joins.

Key Takeaways

1Three approaches to the same churn task: Traditional ML (12.3 hours, 878 lines, 62.44 AUROC), RDL (30 minutes, 56 lines, 75.83 AUROC), KumoRFM (1 second, 1 line, 76.71 AUROC zero-shot, 81.14 fine-tuned).
2The 14-point AUROC gap between traditional ML and KumoRFM is a coverage gap, not a model gap. Flat feature tables explore 4-17% of the possible feature space. The foundation model sees the full relational structure.
3Usage trajectory matters more than usage counts. A customer declining 60% per week is different from one who always had low usage. Flattening destroys this temporal signal.
4Traditional ML rebuilds from scratch for every task. RDL retrains per task. A foundation model handles any prediction task zero-shot. For teams with 20+ prediction tasks, the economics are fundamentally different.
5The bottleneck was never the model. It was the 12.3 hours of feature engineering that came before it. Eliminating that step changes what is possible.

How to Predict Churn Without Feature Engineering

The setup

Approach 1: Traditional ML

What you build

What it costs

Approach 2: Relational Deep Learning (RDL)

What you build

What changes

What stays the same

Approach 3: Foundation model (KumoRFM)

What you build

What changes

Where the accuracy difference comes from

When to use which approach

Traditional ML still works when

RDL makes sense when

Foundation models are the right choice when

The bottom line

Frequently asked questions

Can you predict churn without building features manually?

How accurate is churn prediction without feature engineering?

What is PQL (Predictive Query Language)?

How long does it take to build a churn model with KumoRFM vs traditional ML?

Do I need to flatten my database into a single table for churn prediction?

Related topics

See it in action