Berlin Tech Meetup: The Future of Relational Foundation Models, Systems, and Real-World Applications

Register now:
Learn14 min read

How to Predict Churn Without Feature Engineering

Three approaches to the same churn prediction task. Traditional ML: 12.3 hours, 878 lines. RDL: 30 minutes, 56 lines. Foundation model: 1 second, 1 line. Here's what each one looks like.

TL;DR

  • 1Three approaches to the same churn task: Traditional ML (12.3 hours, 878 lines, 62.44 AUROC), RDL with PyG (30 min, 56 lines, 75.83 AUROC), KumoRFM (1 second, 1 line, 76.71 zero-shot / 81.14 fine-tuned).
  • 2The 14-point AUROC gap between LightGBM and KumoRFM is not a model gap. It is a data coverage gap. Flat feature tables explore 4-17% of the combinatorial feature space. The foundation model sees the full relational structure.
  • 3Usage trajectory matters more than usage counts. A customer declining 60%/week is a different signal than one who always had low usage. Flattening into logins_30d destroys this temporal information.
  • 4Traditional ML rebuilds from scratch for every task. RDL retrains per task. A foundation model handles any prediction task zero-shot. For teams maintaining 20+ models, the economics are fundamentally different.
  • 5The bottleneck was never the model. It was the 12.3 hours of feature engineering that came before it. Eliminating that step changes what is possible for enterprise ML teams.

Churn prediction is the most common ML use case in enterprise. Every subscription business, every marketplace, every SaaS company has a version of the same question: which customers are about to leave?

The answer is usually the same too. A data scientist spends two weeks writing SQL joins, building aggregate features, training a gradient boosted tree, and deploying it behind a feature store. Two months later, the model drifts and someone rebuilds it.

There are now three fundamentally different ways to solve this problem. They differ not in the model architecture but in how they consume data. Here is the same churn prediction task solved three ways, with real code and real numbers.

saas_customers

customer_idplanMRRtenure_monthscontract_type
C-5501Enterprise$4,20018Annual
C-5502Pro$8907Monthly
C-5503Enterprise$6,80024Annual
C-5504Starter$1493Monthly
C-5505Pro$89011Monthly

usage_events (last 30 days)

customer_idloginsAPI_callsfeatures_usedsupport_ticketstrend
C-550114228,40012 of 150Stable
C-550283403 of 104Declining (-60%/week)
C-55038918,2009 of 152Stable
C-5504311,2006 of 80Growing (+40%/week)
C-55053421 of 106Near-zero (was 45 logins/mo)

Highlighted: C-5502 is declining rapidly with rising support tickets. C-5505 has nearly stopped using the product. A flat feature table showing 'logins_30d = 8' and 'logins_30d = 3' misses that C-5502 is accelerating downward while C-5505 has already flatlined.

The setup

Imagine a B2B SaaS company with a standard relational database. Five tables: customers, subscriptions, usage_events, support_tickets, and invoices. The goal: predict which customers will churn in the next 30 days.

This is a textbook scenario. The signal is distributed across all five tables. A customer who logged in 50 times last month but filed 8 support tickets and downgraded their plan is very different from one who logged in 50 times with no tickets and an annual contract. The patterns that predict churn live in the relationships between these tables, not in any single one.

Approach 1: Traditional ML

What you build

The traditional approach requires flattening all five tables into a single feature table with one row per customer. This means writing SQL to join, aggregate, and compute features like:

  • total_logins_last_30d, total_logins_last_90d
  • avg_session_duration_7d, avg_session_duration_30d
  • num_support_tickets_30d, avg_ticket_severity
  • days_since_last_login, days_since_last_payment
  • subscription_tier, months_as_customer, contract_type
  • invoice_amount_trend_90d, payment_delay_count

A typical churn model uses 80 to 200 features like these. Each one requires a SQL query with JOINs, GROUP BYs, and window functions. Then you need to handle nulls, normalize numerical columns, encode categoricals, and split by time to avoid leakage.

After all that, you train a LightGBM or XGBoost model, tune hyperparameters, evaluate on a holdout set, and push it to production.

What it costs

The Stanford RelBench study measured this rigorously. Experienced data scientists spent an average of 12.3 hours and wrote 878 lines of SQL and Python per prediction task. The standard deviation was 77 lines, meaning even experts diverge significantly in their approach.

Multiply that by the 15 to 40 prediction tasks a typical enterprise data science team maintains, and you are looking at person-months of feature engineering work just to keep models current.

Approach 2: Relational Deep Learning (RDL)

What you build

Relational Deep Learning, published at ICML 2024, takes a different approach. Instead of flattening the database into a feature table, you represent it as a temporal heterogeneous graph. Each row becomes a node. Each foreign key relationship becomes an edge. Timestamps are preserved as temporal attributes.

Using PyTorch Geometric and the RelBench framework, the setup looks roughly like this: you define your database schema, specify the prediction target (churn within 30 days for the customers table), and let a graph neural network learn which cross-table patterns are predictive. The GNN passes messages along edges (foreign key relationships), aggregating information from neighboring nodes across multiple hops.

The code is approximately 56 lines of Python. You load the dataset, define the task, configure a heterogeneous GNN (typically a GraphSAGE or GAT variant), train for a few epochs, and evaluate.

What changes

No feature engineering at all. The GNN discovers which patterns across tables are predictive by learning message-passing functions over the graph structure. It can find multi-hop signals (a customer's churn risk depends on the satisfaction scores of products they bought, which depends on the return rates of those products) that no human would enumerate.

The time drops from 12.3 hours to roughly 30 minutes, mostly spent on data loading and training. The accuracy improves because the model has access to the full relational structure. On the RelBench benchmark, the GNN baseline achieves 75.83 AUROC on classification tasks, compared to 62.44 for LightGBM with manual features.

What stays the same

You still need to train a model from scratch for each prediction task. Want to predict churn? Train a GNN. Want to predict upsell? Train another GNN. Want to predict support ticket escalation? Train another one. The feature engineering is gone, but the per-task training cycle remains.

Approach 3: Foundation model (KumoRFM)

What you build

KumoRFM is a foundation model pre-trained on billions of relational patterns across thousands of diverse databases. It has already learned the universal structures that recur across relational data: recency effects, frequency patterns, temporal dynamics, graph topology, cross-table propagation.

To predict churn, you write one line of PQL (Predictive Query Language):

PREDICT churn FOR customers WHERE subscription.status = 'active' WITHIN 30 days

That is it. No feature engineering. No model training. No pipeline. The foundation model reads your database schema, constructs the temporal graph internally, and returns predictions.

What changes

Everything. The time drops from 12.3 hours to under 1 second. The code drops from 878 lines to 1 line. And because the foundation model has seen thousands of databases, it generalizes to your data without any task-specific training.

On the RelBench benchmark, KumoRFM zero-shot achieves 76.71 AUROC, outperforming both the manual feature engineering approach (62.44) and the supervised GNN baseline (75.83). Fine-tuning on your data pushes this to 81.14 AUROC.

Traditional ML pipeline

  • 12.3 hours of data scientist time
  • 878 lines of SQL and Python
  • 80-200 hand-crafted features
  • Rebuild from scratch for each task
  • 62.44 AUROC on RelBench classification

Foundation model (KumoRFM)

  • Under 1 second to prediction
  • 1 line of PQL
  • Full relational structure preserved
  • Same model handles any prediction task
  • 76.71 AUROC zero-shot, 81.14 fine-tuned

approach_comparison

dimensionTraditional MLRDL (PyG)KumoRFM
Time to first prediction12.3 hours~30 minutes<1 second
Lines of code878561
Feature engineering requiredYes (80% of time)NoNo
Per-task training requiredYesYesNo (zero-shot)
AUROC (RelBench classification)62.4475.8376.71 (81.14 fine-tuned)
Multi-hop pattern discoveryManual onlyAutomaticAutomatic + pre-trained

The 14-point AUROC gap between traditional ML (62.44) and KumoRFM (76.71) is not about model architecture. It is about data access: the foundation model sees the full relational structure that flattening destroys.

PQL Query

PREDICT churn
FOR EACH customers.customer_id
WHERE subscriptions.status = 'active'
WITHIN 30 days

One line of PQL replaces the entire traditional ML pipeline: SQL joins, feature computation, model training, and deployment. The foundation model reads all five tables directly.

Output

customer_idchurn_probtop_signalurgency
C-55050.94Usage collapsed + 6 tickets + monthly contractImmediate
C-55020.82Usage declining 60%/week + 4 ticketsThis week
C-55030.212 tickets but usage stable, annual lock-inMonitor
C-55040.09New but usage growing 40%/weekLow
C-55010.04High engagement, annual contract, 0 ticketsLow

Where the accuracy difference comes from

The 14-point AUROC gap between LightGBM with manual features (62.44) and KumoRFM zero-shot (76.71) is not about the model architecture. It is about the data the model can see.

support_tickets (raw relational data)

ticket_idcustomer_iddatecategoryseverityresolved_hours
T-901C-5502Feb 18Bug reportHigh48
T-902C-5502Feb 22Bug reportCritical72
T-903C-5502Mar 1Cancellation requestCriticalPending
T-904C-5505Feb 5Feature requestLow4
T-905C-5505Feb 28Bug reportHigh36
T-906C-5505Mar 5Bug reportCriticalPending
T-907C-5505Mar 8Bug reportCriticalPending
T-908C-5505Mar 10Bug reportCriticalPending
T-909C-5505Mar 12Cancellation requestCriticalPending

Highlighted: both C-5502 and C-5505 escalated from feature requests / bug reports to cancellation requests. The severity trajectory (Low to High to Critical) and the category shift (bugs to cancellation) tell the churn story.

flat_feature_table (what LightGBM sees)

customer_idtickets_30davg_severitylogins_30dapi_calls_30dchurn_signal
C-550242.58340Moderate (by numbers)
C-550563.2342High (by numbers)
C-550400311,200Low

The flat table reduces 9 ticket rows into 'tickets_30d' counts and 'avg_severity'. It cannot see that C-5502's tickets escalated from 'Bug report' to 'Cancellation request'. It cannot see that C-5505 filed 5 critical bugs in 2 weeks. The temporal escalation pattern is destroyed.

A LightGBM model sees whatever features you build. If you wroteavg_logins_last_30d but not login_frequency_acceleration, the model cannot use that signal. If you aggregated support tickets into a count but lost the temporal sequence (three tickets in one day vs one per week), that pattern is gone.

A foundation model sees the raw relational structure. Every row, every timestamp, every foreign key relationship. It discovers patterns that span multiple tables and time horizons, including the multi-hop and temporal patterns that humans do not enumerate.

The accuracy difference is a coverage difference. The model that sees more data finds more patterns.

When to use which approach

The three approaches are not equally suited to every situation.

Traditional ML still works when

  • You have a single flat table with well-understood features
  • The prediction task is simple and does not span multiple tables
  • You have a mature feature store and the features already exist
  • Regulatory requirements demand that every feature be explicitly interpretable

RDL makes sense when

  • You have multi-table data and need higher accuracy than manual features can deliver
  • You have ML engineering capacity to train and deploy GNNs
  • You want to invest in a single high-value prediction task (not dozens)

Foundation models are the right choice when

  • You have relational data across multiple tables and need predictions fast
  • You have dozens of prediction tasks and cannot staff a team to build each one manually
  • Speed to first prediction matters more than squeezing the last 0.5% of accuracy (though fine-tuning closes that gap)
  • Your data science team is drowning in feature engineering and you want them working on business problems instead

The bottom line

Churn prediction has not changed in a decade. The databases got bigger, the models got slightly better, but the process stayed the same: flatten, aggregate, engineer, train, deploy, repeat. The bottleneck was never the model. It was the 12.3 hours of feature engineering that came before it.

Relational deep learning removed the feature engineering step. Foundation models removed the training step. What is left is a single query that returns predictions from raw relational data in under a second.

The question is not whether this approach works. The benchmarks are clear. The question is how much longer you want your data science team spending 80% of their time writing SQL joins.

Frequently asked questions

Can you predict churn without building features manually?

Yes. Relational deep learning and foundation models like KumoRFM learn directly from raw relational tables. They represent your database as a temporal graph and discover predictive patterns across tables automatically, eliminating the need to hand-engineer features like days_since_last_order or avg_spend_90d.

How accurate is churn prediction without feature engineering?

On the RelBench benchmark, KumoRFM zero-shot achieves 76.71 AUROC on classification tasks, outperforming LightGBM with manually engineered features (62.44 AUROC) and matching or exceeding supervised GNN baselines (75.83 AUROC). Fine-tuning pushes accuracy to 81.14 AUROC.

What is PQL (Predictive Query Language)?

PQL is a query language developed by Kumo.ai that lets you express prediction tasks in a single line. Instead of writing hundreds of lines of SQL and Python to engineer features and train a model, you write a statement like PREDICT churn FOR customers WITHIN 30 days. The foundation model handles everything else.

How long does it take to build a churn model with KumoRFM vs traditional ML?

Traditional ML takes an average of 12.3 hours and 878 lines of code per prediction task, according to a Stanford study. The RDL approach with PyTorch Geometric reduces that to roughly 30 minutes and 56 lines of Python. KumoRFM reduces it to 1 second and 1 line of PQL.

Do I need to flatten my database into a single table for churn prediction?

With traditional ML, yes. You must join and aggregate multiple tables into a flat feature table with one row per customer. With relational deep learning or KumoRFM, no. The model reads your tables directly, preserving the multi-table structure, temporal sequences, and graph relationships that flattening destroys.

See it in action

KumoRFM delivers predictions on relational data in seconds. No feature engineering, no ML pipelines. Try it free.