Every enterprise ML project starts the same way: a prediction question (who will churn? what will they buy? will this loan default?) and a relational database with the data to answer it. The database has 5, 10, maybe 50 interconnected tables. The ML model needs a flat table.
How you bridge that gap determines how long it takes, how accurate the result is, and how much ongoing maintenance your team absorbs. There are three viable approaches today, each with fundamentally different tradeoffs.
e_commerce_db — schema overview
| Table | Rows | Key Columns | Foreign Keys |
|---|---|---|---|
| customers | 120,000 | customer_id, name, signup_date, segment | — |
| orders | 3,400,000 | order_id, customer_id, total, order_date | customer_id → customers |
| products | 45,000 | product_id, category, brand, price | — |
| order_items | 9,200,000 | item_id, order_id, product_id, qty | order_id → orders, product_id → products |
| support_tickets | 890,000 | ticket_id, customer_id, category, resolved | customer_id → customers |
Five tables, three foreign-key join paths. The question: which customers will churn in the next 90 days?
Approach 1: Flatten and engineer features manually
This is what 90% of data science teams do today. The process has five steps.
Step 1: Understand the schema. Map the tables, foreign keys, and relationships. For a customer churn prediction on an e-commerce database with customers, orders, products, reviews, and support tickets, this means understanding 5 tables and their join paths.
Step 2: Define the entity and target. One row per customer, target variable is "churned in next 90 days." This seems simple but requires careful temporal handling to avoid data leakage: features must be computed from data available before the prediction point.
Step 3: Write the feature SQL. For each customer, compute aggregates from each related table. This is where the bulk of the work lives.
A typical feature set might include: total orders in the last 30, 60, 90 days; average order value over the same windows; number of distinct product categories purchased; number of support tickets filed; average review score given; days since last order; days since last support ticket; ratio of returns to orders; number of sessions in the last 14 days.
Each feature requires a SQL query with joins, aggregations, and time window logic. Fifty features means fifty queries. The Stanford study measured this at 878 lines of code and 12.3 hours per task for experienced data scientists.
Step 4: Train the model. Feed the flat feature table to XGBoost or LightGBM. Tune hyperparameters. Cross-validate. This takes 30-60 minutes of compute time and a few hours of data scientist time.
Step 5: Validate and deploy. Check for data leakage, test on held-out data, build a serving pipeline, set up monitoring for data drift and model decay.
support_tickets (raw relational data for C-002)
| ticket_id | customer_id | date | category | severity | resolved_hrs |
|---|---|---|---|---|---|
| TK-201 | C-002 | Jan 8 | Shipping delay | Medium | 12 |
| TK-202 | C-002 | Feb 2 | Wrong item received | High | 48 |
| TK-203 | C-002 | Feb 28 | Refund not processed | Critical | Pending |
Highlighted: C-002's tickets escalated from 'Shipping delay' to 'Wrong item' to 'Refund not processed'. Severity rose from Medium to Critical. The third ticket is still unresolved. This escalation trajectory predicts churn far better than 'tickets = 3'.
flattened_feature_table (what LightGBM actually receives)
| customer_id | orders_30d | orders_90d | avg_order_value | distinct_categories | support_tickets | days_since_last_order |
|---|---|---|---|---|---|---|
| C-001 | 4 | 11 | $67.30 | 5 | 0 | 3 |
| C-002 | 0 | 2 | $112.50 | 2 | 3 | 74 |
| C-003 | 6 | 18 | $45.80 | 8 | 1 | 1 |
| C-004 | 1 | 4 | $89.00 | 3 | 0 | 22 |
C-002 shows support_tickets = 3 and days_since_last_order = 74. The flat table cannot distinguish 3 escalating unresolved complaints from 3 resolved feature requests. It cannot show that C-002's last order was returned. The relational context is gone.
Approach 2: Graph neural networks on relational data
Relational deep learning (RDL), published at ICML 2024, introduced a fundamentally different approach. Instead of flattening the database into a table, represent it as a graph and train a GNN directly on the structure.
Step 1: Convert to a temporal heterogeneous graph. Each row in each table becomes a node. Foreign key relationships become edges. Timestamps create a temporal ordering. The e-commerce database becomes a graph with customer nodes, order nodes, product nodes, review nodes, and support ticket nodes, all connected by their natural relationships.
Step 2: Define the prediction task. Which nodes do you want to predict for (customers), and what is the target (churn in 90 days)? This is the same definition as Approach 1, but no feature engineering follows.
Step 3: Train the GNN. The graph neural network learns by message passing: each node aggregates information from its neighbors, then updates its own representation. After multiple rounds (typically 2-4 layers), each customer node's representation contains information from its orders, the products in those orders, the reviews of those products by other customers, and the support tickets associated with those orders. The model decides what information is predictive. No human specifies features.
Step 4: Validate and deploy. Same validation process as Approach 1, but no feature pipeline to maintain. The model consumes the raw relational structure directly.
Approach 1: Flatten + LightGBM
- 12.3 hours, 878 lines of code per task
- 50-200 manually engineered features
- 62.44 AUROC on RelBench classification tasks
- Multi-hop patterns lost in aggregation
- Full rewrite needed for each new prediction question
Approach 2: GNN on relational graph
- ~30 minutes for graph construction and training
- Zero manual feature engineering
- 75.83 AUROC on RelBench classification tasks
- Multi-hop patterns captured through message passing
- Same architecture for any prediction task on same schema
gnn_message_passing (what the model sees for C-002)
| hop | source_node | edge_type | information_received |
|---|---|---|---|
| 1 | C-002 (customer) | placed_order | 2 orders, last one 74 days ago, $112.50 avg |
| 1 | C-002 (customer) | filed_ticket | 3 tickets, escalating severity, 1 pending |
| 2 | Order O-088 (order) | contains_item | Product P-231 (Electronics), returned |
| 2 | Order O-091 (order) | contains_item | Product P-445 (Electronics), kept |
| 3 | Product P-231 | reviewed_by | Avg 2.1 stars from other buyers (was 4.3) |
| 3 | Product P-445 | reviewed_by | Avg 4.6 stars from other buyers (stable) |
Highlighted: at hop 3, the GNN discovers that P-231 (the product C-002 returned) has declining reviews from other customers. This 3-hop signal (customer to order to product to reviews) explains both the return and the churn risk. No human engineer would write this feature.
The accuracy difference is significant. On the RelBench benchmark, GNNs scored 75.83 AUROC compared to 62.44 for LightGBM with manual features. The GNN outperformed manual feature engineering on 11 of 12 classification tasks, despite the data scientists having unlimited time and full domain knowledge. The patterns that the GNN discovers automatically are patterns that humans do not think to encode.
Approach 3: Foundation model on relational data
KumoRFM takes the GNN approach and adds pre-training. Instead of training a GNN from scratch on your specific database, the foundation model has been pre-trained on billions of relational patterns across thousands of diverse databases. It has already learned the universal patterns that recur across relational data: recency and frequency dynamics, graph topology effects, temporal decay, cross-table signal propagation.
Step 1: Connect your database. Point KumoRFM at your relational database. The model reads the schema and maps it to its internal graph representation.
Step 2: Write a predictive query. One line of PQL (Predictive Query Language):
PREDICT churn_90d FOR customers
Step 3: Receive predictions. The model returns a prediction for every customer, based on the full relational context of your database. No feature engineering, no model training, no hyperparameter tuning.
The time from connected database to predictions is measured in seconds. The code is one line. And the accuracy is higher than both alternatives.
PQL Query
PREDICT churn_90d FOR EACH customers.customer_id
One line replaces the entire feature engineering pipeline. KumoRFM reads the relational schema, discovers cross-table patterns, and returns predictions in seconds.
Output
| customer_id | churn_probability | confidence | top_signal |
|---|---|---|---|
| C-001 | 0.12 | 0.94 | High recent purchase frequency |
| C-002 | 0.87 | 0.91 | Unresolved support escalation + 74-day gap |
| C-003 | 0.05 | 0.96 | Accelerating order cadence across categories |
| C-004 | 0.34 | 0.88 | Moderate recency, low engagement breadth |
Why the foundation model outperforms
The accuracy advantage of KumoRFM over task-specific GNNs comes from pre-training. Like GPT for text or CLIP for images, a foundation model that has seen thousands of databases has learned patterns that transfer across domains.
Cross-domain pattern transfer
Recency effects work the same way in e-commerce (how recently a customer ordered) and in banking (how recently a borrower made a payment). Frequency dynamics transfer between telecoms (call frequency) and SaaS (login frequency). Graph topology effects (how a node's neighbors influence its behavior) are universal. The foundation model has learned these patterns from thousands of databases and applies them to yours.
Better generalization on small data
Task-specific GNNs need enough labeled data to learn from scratch. For rare events (fraud, churn of high-value customers, unusual defaults), labeled examples are scarce. The foundation model's pre-trained understanding of relational patterns gives it a strong prior that compensates for limited task-specific data.
Richer temporal understanding
Pre-training across databases with different temporal granularities (daily transactions, weekly logins, monthly payments) gives the foundation model a deeper understanding of temporal dynamics than any single-database GNN can develop.
The three approaches side by side
| Metric | Flatten + LightGBM | Task-specific GNN | KumoRFM |
|---|---|---|---|
| Time to prediction | 12.3 hours | ~30 minutes | ~1 second |
| Lines of code | 878 | ~50-100 | 1 |
| AUROC (RelBench) | 62.44 | 75.83 | 76.71 (zero-shot) |
| Feature engineering | Manual, 80% of effort | Automatic | Automatic |
| Multi-hop patterns | Lost in aggregation | Captured | Captured + pre-trained |
| New prediction task | Start from scratch | Retrain GNN | New query, same model |
| Ongoing maintenance | Feature pipelines + retraining | Retraining | None |
Which approach should you use?
The answer depends on your constraints, but the trajectory is clear.
If you have a large data science team and 3-6 months, Approach 1 (flatten + LightGBM) will produce a working model. It is well-understood, interpretable, and your team already knows how to do it. The cost is time and ongoing maintenance. The accuracy ceiling is lower.
If you need better accuracy and have ML engineering capacity, Approach 2 (task-specific GNN) is a significant upgrade. It eliminates feature engineering, captures multi-hop patterns, and improves accuracy by 10-15 points on most tasks. It requires familiarity with graph ML frameworks and GPU infrastructure.
If you want the best accuracy with the least effort, Approach 3 (foundation model) delivers the highest accuracy with zero data science effort. One line of PQL replaces months of pipeline work. The tradeoff is that you are using a pre-trained model rather than a custom one, which means less control over model internals.
The direction of the field is clear: from manual to automatic, from flat to relational, from task-specific to foundation. The data has always been relational. The models are finally catching up.