Berlin Tech Meetup: The Future of Relational Foundation Models, Systems, and Real-World Applications

Register now:
Learn14 min read

How to Build ML on Relational Databases (Three Approaches Compared)

Your data lives in a relational database with multiple connected tables. You want a prediction. Here are three ways to get there, ranging from 12.3 hours of manual work to 1 second of query time. The accuracy numbers may surprise you.

TL;DR

  • 1Three approaches to ML on relational data: flatten + LightGBM (12.3 hours, 878 lines, 62.44 AUROC), graph neural networks (30 minutes, zero feature engineering, 75.83 AUROC), and KumoRFM foundation model (1 second, 1 line of PQL, 76.71 zero-shot / 81.14 fine-tuned).
  • 2Flattening relational data into a single table destroys multi-hop relationships, temporal sequences, and graph topology. These are exactly the patterns that drive the 13-19 point AUROC gap on RelBench.
  • 3GNNs represent the database as a temporal graph and learn patterns through message passing. No feature engineering needed. They outperform manual features on 11 of 12 RelBench classification tasks.
  • 4KumoRFM adds pre-training on billions of relational patterns across thousands of databases. Cross-domain pattern transfer (recency, frequency, graph topology) explains why it beats task-specific GNNs without any training on your data.
  • 5For organizations running multiple prediction tasks on the same relational data, foundation models amortize cost across all tasks. One model, one data connection, unlimited prediction questions.

Every enterprise ML project starts the same way: a prediction question (who will churn? what will they buy? will this loan default?) and a relational database with the data to answer it. The database has 5, 10, maybe 50 interconnected tables. The ML model needs a flat table.

How you bridge that gap determines how long it takes, how accurate the result is, and how much ongoing maintenance your team absorbs. There are three viable approaches today, each with fundamentally different tradeoffs.

e_commerce_db — schema overview

TableRowsKey ColumnsForeign Keys
customers120,000customer_id, name, signup_date, segment
orders3,400,000order_id, customer_id, total, order_datecustomer_id → customers
products45,000product_id, category, brand, price
order_items9,200,000item_id, order_id, product_id, qtyorder_id → orders, product_id → products
support_tickets890,000ticket_id, customer_id, category, resolvedcustomer_id → customers

Five tables, three foreign-key join paths. The question: which customers will churn in the next 90 days?

Approach 1: Flatten and engineer features manually

This is what 90% of data science teams do today. The process has five steps.

Step 1: Understand the schema. Map the tables, foreign keys, and relationships. For a customer churn prediction on an e-commerce database with customers, orders, products, reviews, and support tickets, this means understanding 5 tables and their join paths.

Step 2: Define the entity and target. One row per customer, target variable is "churned in next 90 days." This seems simple but requires careful temporal handling to avoid data leakage: features must be computed from data available before the prediction point.

Step 3: Write the feature SQL. For each customer, compute aggregates from each related table. This is where the bulk of the work lives.

A typical feature set might include: total orders in the last 30, 60, 90 days; average order value over the same windows; number of distinct product categories purchased; number of support tickets filed; average review score given; days since last order; days since last support ticket; ratio of returns to orders; number of sessions in the last 14 days.

Each feature requires a SQL query with joins, aggregations, and time window logic. Fifty features means fifty queries. The Stanford study measured this at 878 lines of code and 12.3 hours per task for experienced data scientists.

Step 4: Train the model. Feed the flat feature table to XGBoost or LightGBM. Tune hyperparameters. Cross-validate. This takes 30-60 minutes of compute time and a few hours of data scientist time.

Step 5: Validate and deploy. Check for data leakage, test on held-out data, build a serving pipeline, set up monitoring for data drift and model decay.

support_tickets (raw relational data for C-002)

ticket_idcustomer_iddatecategoryseverityresolved_hrs
TK-201C-002Jan 8Shipping delayMedium12
TK-202C-002Feb 2Wrong item receivedHigh48
TK-203C-002Feb 28Refund not processedCriticalPending

Highlighted: C-002's tickets escalated from 'Shipping delay' to 'Wrong item' to 'Refund not processed'. Severity rose from Medium to Critical. The third ticket is still unresolved. This escalation trajectory predicts churn far better than 'tickets = 3'.

flattened_feature_table (what LightGBM actually receives)

customer_idorders_30dorders_90davg_order_valuedistinct_categoriessupport_ticketsdays_since_last_order
C-001411$67.30503
C-00202$112.502374
C-003618$45.80811
C-00414$89.003022

C-002 shows support_tickets = 3 and days_since_last_order = 74. The flat table cannot distinguish 3 escalating unresolved complaints from 3 resolved feature requests. It cannot show that C-002's last order was returned. The relational context is gone.

Approach 2: Graph neural networks on relational data

Relational deep learning (RDL), published at ICML 2024, introduced a fundamentally different approach. Instead of flattening the database into a table, represent it as a graph and train a GNN directly on the structure.

Step 1: Convert to a temporal heterogeneous graph. Each row in each table becomes a node. Foreign key relationships become edges. Timestamps create a temporal ordering. The e-commerce database becomes a graph with customer nodes, order nodes, product nodes, review nodes, and support ticket nodes, all connected by their natural relationships.

Step 2: Define the prediction task. Which nodes do you want to predict for (customers), and what is the target (churn in 90 days)? This is the same definition as Approach 1, but no feature engineering follows.

Step 3: Train the GNN. The graph neural network learns by message passing: each node aggregates information from its neighbors, then updates its own representation. After multiple rounds (typically 2-4 layers), each customer node's representation contains information from its orders, the products in those orders, the reviews of those products by other customers, and the support tickets associated with those orders. The model decides what information is predictive. No human specifies features.

Step 4: Validate and deploy. Same validation process as Approach 1, but no feature pipeline to maintain. The model consumes the raw relational structure directly.

Approach 1: Flatten + LightGBM

  • 12.3 hours, 878 lines of code per task
  • 50-200 manually engineered features
  • 62.44 AUROC on RelBench classification tasks
  • Multi-hop patterns lost in aggregation
  • Full rewrite needed for each new prediction question

Approach 2: GNN on relational graph

  • ~30 minutes for graph construction and training
  • Zero manual feature engineering
  • 75.83 AUROC on RelBench classification tasks
  • Multi-hop patterns captured through message passing
  • Same architecture for any prediction task on same schema

gnn_message_passing (what the model sees for C-002)

hopsource_nodeedge_typeinformation_received
1C-002 (customer)placed_order2 orders, last one 74 days ago, $112.50 avg
1C-002 (customer)filed_ticket3 tickets, escalating severity, 1 pending
2Order O-088 (order)contains_itemProduct P-231 (Electronics), returned
2Order O-091 (order)contains_itemProduct P-445 (Electronics), kept
3Product P-231reviewed_byAvg 2.1 stars from other buyers (was 4.3)
3Product P-445reviewed_byAvg 4.6 stars from other buyers (stable)

Highlighted: at hop 3, the GNN discovers that P-231 (the product C-002 returned) has declining reviews from other customers. This 3-hop signal (customer to order to product to reviews) explains both the return and the churn risk. No human engineer would write this feature.

The accuracy difference is significant. On the RelBench benchmark, GNNs scored 75.83 AUROC compared to 62.44 for LightGBM with manual features. The GNN outperformed manual feature engineering on 11 of 12 classification tasks, despite the data scientists having unlimited time and full domain knowledge. The patterns that the GNN discovers automatically are patterns that humans do not think to encode.

Approach 3: Foundation model on relational data

KumoRFM takes the GNN approach and adds pre-training. Instead of training a GNN from scratch on your specific database, the foundation model has been pre-trained on billions of relational patterns across thousands of diverse databases. It has already learned the universal patterns that recur across relational data: recency and frequency dynamics, graph topology effects, temporal decay, cross-table signal propagation.

Step 1: Connect your database. Point KumoRFM at your relational database. The model reads the schema and maps it to its internal graph representation.

Step 2: Write a predictive query. One line of PQL (Predictive Query Language):

PREDICT churn_90d FOR customers

Step 3: Receive predictions. The model returns a prediction for every customer, based on the full relational context of your database. No feature engineering, no model training, no hyperparameter tuning.

The time from connected database to predictions is measured in seconds. The code is one line. And the accuracy is higher than both alternatives.

PQL Query

PREDICT churn_90d
FOR EACH customers.customer_id

One line replaces the entire feature engineering pipeline. KumoRFM reads the relational schema, discovers cross-table patterns, and returns predictions in seconds.

Output

customer_idchurn_probabilityconfidencetop_signal
C-0010.120.94High recent purchase frequency
C-0020.870.91Unresolved support escalation + 74-day gap
C-0030.050.96Accelerating order cadence across categories
C-0040.340.88Moderate recency, low engagement breadth

Why the foundation model outperforms

The accuracy advantage of KumoRFM over task-specific GNNs comes from pre-training. Like GPT for text or CLIP for images, a foundation model that has seen thousands of databases has learned patterns that transfer across domains.

Cross-domain pattern transfer

Recency effects work the same way in e-commerce (how recently a customer ordered) and in banking (how recently a borrower made a payment). Frequency dynamics transfer between telecoms (call frequency) and SaaS (login frequency). Graph topology effects (how a node's neighbors influence its behavior) are universal. The foundation model has learned these patterns from thousands of databases and applies them to yours.

Better generalization on small data

Task-specific GNNs need enough labeled data to learn from scratch. For rare events (fraud, churn of high-value customers, unusual defaults), labeled examples are scarce. The foundation model's pre-trained understanding of relational patterns gives it a strong prior that compensates for limited task-specific data.

Richer temporal understanding

Pre-training across databases with different temporal granularities (daily transactions, weekly logins, monthly payments) gives the foundation model a deeper understanding of temporal dynamics than any single-database GNN can develop.

The three approaches side by side

MetricFlatten + LightGBMTask-specific GNNKumoRFM
Time to prediction12.3 hours~30 minutes~1 second
Lines of code878~50-1001
AUROC (RelBench)62.4475.8376.71 (zero-shot)
Feature engineeringManual, 80% of effortAutomaticAutomatic
Multi-hop patternsLost in aggregationCapturedCaptured + pre-trained
New prediction taskStart from scratchRetrain GNNNew query, same model
Ongoing maintenanceFeature pipelines + retrainingRetrainingNone

Which approach should you use?

The answer depends on your constraints, but the trajectory is clear.

If you have a large data science team and 3-6 months, Approach 1 (flatten + LightGBM) will produce a working model. It is well-understood, interpretable, and your team already knows how to do it. The cost is time and ongoing maintenance. The accuracy ceiling is lower.

If you need better accuracy and have ML engineering capacity, Approach 2 (task-specific GNN) is a significant upgrade. It eliminates feature engineering, captures multi-hop patterns, and improves accuracy by 10-15 points on most tasks. It requires familiarity with graph ML frameworks and GPU infrastructure.

If you want the best accuracy with the least effort, Approach 3 (foundation model) delivers the highest accuracy with zero data science effort. One line of PQL replaces months of pipeline work. The tradeoff is that you are using a pre-trained model rather than a custom one, which means less control over model internals.

The direction of the field is clear: from manual to automatic, from flat to relational, from task-specific to foundation. The data has always been relational. The models are finally catching up.

Frequently asked questions

Can machine learning work directly on relational databases?

Yes, but until recently, the only option was to flatten the relational structure into a single table through feature engineering. Relational deep learning (published at ICML 2024) showed that you can represent a relational database as a temporal graph and train graph neural networks directly on it. Foundation models like KumoRFM go further: they are pre-trained on relational patterns and can produce predictions on any relational database without task-specific training.

Why is flattening relational data into a single table a problem?

Flattening destroys three types of information: multi-hop relationships (patterns that span 3-4 tables), temporal sequences (the order and timing of events), and graph topology (how entities are connected). A Stanford study found that this process takes 12.3 hours and 878 lines of code per prediction task, and the resulting flat table performs worse than models that learn directly from the relational structure.

What is a graph neural network and how does it work on relational data?

A graph neural network (GNN) represents data as nodes and edges. For relational databases, rows become nodes and foreign keys become edges. The GNN learns by passing messages along edges: each node aggregates information from its neighbors, then updates its own representation. After multiple rounds of message passing, each node's representation captures information from multiple hops in the graph, including cross-table patterns that flat models miss.

What is PQL (Predictive Query Language)?

PQL is the query language used by KumoRFM to define prediction tasks on relational databases. Instead of writing SQL to extract features and Python to train models, you write a single line like 'PREDICT churn FOR customers' or 'PREDICT revenue_next_90d FOR accounts.' The foundation model handles feature discovery, pattern learning, and prediction generation automatically from the raw relational data.

How do the three approaches compare on accuracy?

On the RelBench benchmark (7 databases, 30 tasks, 103M+ rows): LightGBM with manual feature engineering scored 62.44 AUROC, supervised GNNs scored 75.83 AUROC, and KumoRFM zero-shot scored 76.71 AUROC. With fine-tuning, KumoRFM reached 81.14 AUROC. The foundation model outperforms manual feature engineering despite requiring zero human effort.

See it in action

KumoRFM delivers predictions on relational data in seconds. No feature engineering, no ML pipelines. Try it free.