Can machine learning work directly on relational databases?

Yes, but until recently, the only option was to flatten the relational structure into a single table through feature engineering. Relational deep learning (published at ICML 2024) showed that you can represent a relational database as a temporal graph and train graph neural networks directly on it. Foundation models like KumoRFM go further: they are pre-trained on relational patterns and can produce predictions on any relational database without task-specific training.

Why is flattening relational data into a single table a problem?

Flattening destroys three types of information: multi-hop relationships (patterns that span 3-4 tables), temporal sequences (the order and timing of events), and graph topology (how entities are connected). A Stanford study found that this process takes 12.3 hours and 878 lines of code per prediction task, and the resulting flat table performs worse than models that learn directly from the relational structure.

What is PQL (Predictive Query Language)?

PQL is the query language used by KumoRFM to define prediction tasks on relational databases. Instead of writing SQL to extract features and Python to train models, you write a single line like 'PREDICT churn FOR customers' or 'PREDICT revenue_next_90d FOR accounts.' The foundation model handles feature discovery, pattern learning, and prediction generation automatically from the raw relational data.

How do the three approaches compare on accuracy?

On the RelBench benchmark (7 databases, 30 tasks, 103M+ rows): LightGBM with manual feature engineering scored 62.44 AUROC, supervised GNNs scored 75.83 AUROC, and KumoRFM zero-shot scored 76.71 AUROC. With fine-tuning, KumoRFM reached 81.14 AUROC. The foundation model outperforms manual feature engineering despite requiring zero human effort.

How to Build ML on Relational Databases (Three Approaches Compared) | Kumo.ai

Q: What is a graph neural network and how does it work on relational data?

A graph neural network (GNN) represents data as nodes and edges. For relational databases, rows become nodes and foreign keys become edges. The GNN learns by passing messages along edges: each node aggregates information from its neighbors, then updates its own representation. After multiple rounds of message passing, each node's representation captures information from multiple hops in the graph, including cross-table patterns that flat models miss.

Every enterprise ML project starts the same way: a prediction question (who will churn? what will they buy? will this loan default?) and a relational database with the data to answer it. The database has 5, 10, maybe 50 interconnected tables. The ML model needs a flat table.

How you bridge that gap determines how long it takes, how accurate the result is, and how much ongoing maintenance your team absorbs. There are three viable approaches today, each with fundamentally different tradeoffs.

e_commerce_db — schema overview

Table	Rows	Key Columns	Foreign Keys
customers	120,000	customer_id, name, signup_date, segment	—
orders	3,400,000	order_id, customer_id, total, order_date	customer_id → customers
products	45,000	product_id, category, brand, price	—
order_items	9,200,000	item_id, order_id, product_id, qty	order_id → orders, product_id → products
support_tickets	890,000	ticket_id, customer_id, category, resolved	customer_id → customers

Five tables, three foreign-key join paths. The question: which customers will churn in the next 90 days?

Approach 1: Flatten and engineer features manually

This is what 90% of data science teams do today. The process has five steps.

Step 1: Understand the schema. Map the tables, foreign keys, and relationships. For a customer churn prediction on an e-commerce database with customers, orders, products, reviews, and support tickets, this means understanding 5 tables and their join paths.

Step 2: Define the entity and target. One row per customer, target variable is "churned in next 90 days." This seems simple but requires careful temporal handling to avoid data leakage: features must be computed from data available before the prediction point.

Step 3: Write the feature SQL. For each customer, compute aggregates from each related table. This is where the bulk of the work lives.

A typical feature set might include: total orders in the last 30, 60, 90 days; average order value over the same windows; number of distinct product categories purchased; number of support tickets filed; average review score given; days since last order; days since last support ticket; ratio of returns to orders; number of sessions in the last 14 days.

Each feature requires a SQL query with joins, aggregations, and time window logic. Fifty features means fifty queries. The Stanford study measured this at 878 lines of code and 12.3 hours per task for experienced data scientists.

Step 4: Train the model. Feed the flat feature table to XGBoost or LightGBM. Tune hyperparameters. Cross-validate. This takes 30-60 minutes of compute time and a few hours of data scientist time.

Step 5: Validate and deploy. Check for data leakage, test on held-out data, build a serving pipeline, set up monitoring for data drift and model decay.

support_tickets (raw relational data for C-002)

ticket_id	customer_id	date	category	severity	resolved_hrs
TK-201	C-002	Jan 8	Shipping delay	Medium	12
TK-202	C-002	Feb 2	Wrong item received	High	48
TK-203	C-002	Feb 28	Refund not processed	Critical	Pending

Highlighted: C-002's tickets escalated from 'Shipping delay' to 'Wrong item' to 'Refund not processed'. Severity rose from Medium to Critical. The third ticket is still unresolved. This escalation trajectory predicts churn far better than 'tickets = 3'.

flattened_feature_table (what LightGBM actually receives)

customer_id	orders_30d	orders_90d	avg_order_value	distinct_categories	support_tickets	days_since_last_order
C-001	4	11	$67.30	5	0	3
C-002	0	2	$112.50	2	3	74
C-003	6	18	$45.80	8	1	1
C-004	1	4	$89.00	3	0	22

C-002 shows support_tickets = 3 and days_since_last_order = 74. The flat table cannot distinguish 3 escalating unresolved complaints from 3 resolved feature requests. It cannot show that C-002's last order was returned. The relational context is gone.

Approach 2: Graph neural networks on relational data

Relational deep learning (RDL), published at ICML 2024, introduced a fundamentally different approach. Instead of flattening the database into a table, represent it as a graph and train a GNN directly on the structure.

Step 1: Convert to a temporal heterogeneous graph. Each row in each table becomes a node. Foreign key relationships become edges. Timestamps create a temporal ordering. The e-commerce database becomes a graph with customer nodes, order nodes, product nodes, review nodes, and support ticket nodes, all connected by their natural relationships.

Step 2: Define the prediction task. Which nodes do you want to predict for (customers), and what is the target (churn in 90 days)? This is the same definition as Approach 1, but no feature engineering follows.

Step 3: Train the GNN. The graph neural network learns by message passing: each node aggregates information from its neighbors, then updates its own representation. After multiple rounds (typically 2-4 layers), each customer node's representation contains information from its orders, the products in those orders, the reviews of those products by other customers, and the support tickets associated with those orders. The model decides what information is predictive. No human specifies features.

Step 4: Validate and deploy. Same validation process as Approach 1, but no feature pipeline to maintain. The model consumes the raw relational structure directly.

Approach 1: Flatten + LightGBM

12.3 hours, 878 lines of code per task
50-200 manually engineered features
62.44 AUROC on RelBench classification tasks
Multi-hop patterns lost in aggregation
Full rewrite needed for each new prediction question

Approach 2: GNN on relational graph

~30 minutes for graph construction and training
Zero manual feature engineering
75.83 AUROC on RelBench classification tasks
Multi-hop patterns captured through message passing
Same architecture for any prediction task on same schema

gnn_message_passing (what the model sees for C-002)

hop	source_node	edge_type	information_received
1	C-002 (customer)	placed_order	2 orders, last one 74 days ago, $112.50 avg
1	C-002 (customer)	filed_ticket	3 tickets, escalating severity, 1 pending
2	Order O-088 (order)	contains_item	Product P-231 (Electronics), returned
2	Order O-091 (order)	contains_item	Product P-445 (Electronics), kept
3	Product P-231	reviewed_by	Avg 2.1 stars from other buyers (was 4.3)
3	Product P-445	reviewed_by	Avg 4.6 stars from other buyers (stable)

Highlighted: at hop 3, the GNN discovers that P-231 (the product C-002 returned) has declining reviews from other customers. This 3-hop signal (customer to order to product to reviews) explains both the return and the churn risk. No human engineer would write this feature.

The accuracy difference is significant. On the RelBench benchmark, GNNs scored 75.83 AUROC compared to 62.44 for LightGBM with manual features. The GNN outperformed manual feature engineering on 11 of 12 classification tasks, despite the data scientists having unlimited time and full domain knowledge. The patterns that the GNN discovers automatically are patterns that humans do not think to encode.

Approach 3: Foundation model on relational data

KumoRFM takes the GNN approach and adds pre-training. Instead of training a GNN from scratch on your specific database, the foundation model has been pre-trained on billions of relational patterns across thousands of diverse databases. It has already learned the universal patterns that recur across relational data: recency and frequency dynamics, graph topology effects, temporal decay, cross-table signal propagation.

Step 1: Connect your database. Point KumoRFM at your relational database. The model reads the schema and maps it to its internal graph representation.

Step 2: Write a predictive query. One line of PQL (Predictive Query Language):

PREDICT churn_90d FOR customers

Step 3: Receive predictions. The model returns a prediction for every customer, based on the full relational context of your database. No feature engineering, no model training, no hyperparameter tuning.

The time from connected database to predictions is measured in seconds. The code is one line. And the accuracy is higher than both alternatives.

PQL Query

PREDICT churn_90d
FOR EACH customers.customer_id

One line replaces the entire feature engineering pipeline. KumoRFM reads the relational schema, discovers cross-table patterns, and returns predictions in seconds.

Output

customer_id	churn_probability	confidence	top_signal
C-001	0.12	0.94	High recent purchase frequency
C-002	0.87	0.91	Unresolved support escalation + 74-day gap
C-003	0.05	0.96	Accelerating order cadence across categories
C-004	0.34	0.88	Moderate recency, low engagement breadth

Why the foundation model outperforms

The accuracy advantage of KumoRFM over task-specific GNNs comes from pre-training. Like GPT for text or CLIP for images, a foundation model that has seen thousands of databases has learned patterns that transfer across domains.

Cross-domain pattern transfer

Recency effects work the same way in e-commerce (how recently a customer ordered) and in banking (how recently a borrower made a payment). Frequency dynamics transfer between telecoms (call frequency) and SaaS (login frequency). Graph topology effects (how a node's neighbors influence its behavior) are universal. The foundation model has learned these patterns from thousands of databases and applies them to yours.

Better generalization on small data

Task-specific GNNs need enough labeled data to learn from scratch. For rare events (fraud, churn of high-value customers, unusual defaults), labeled examples are scarce. The foundation model's pre-trained understanding of relational patterns gives it a strong prior that compensates for limited task-specific data.

Richer temporal understanding

Pre-training across databases with different temporal granularities (daily transactions, weekly logins, monthly payments) gives the foundation model a deeper understanding of temporal dynamics than any single-database GNN can develop.

The three approaches side by side

Metric	Flatten + LightGBM	Task-specific GNN	KumoRFM
Time to prediction	12.3 hours	~30 minutes	~1 second
Lines of code	878	~50-100	1
AUROC (RelBench)	62.44	75.83	76.71 (zero-shot)
Feature engineering	Manual, 80% of effort	Automatic	Automatic
Multi-hop patterns	Lost in aggregation	Captured	Captured + pre-trained
New prediction task	Start from scratch	Retrain GNN	New query, same model
Ongoing maintenance	Feature pipelines + retraining	Retraining	None

Which approach should you use?

The answer depends on your constraints, but the trajectory is clear.

If you have a large data science team and 3-6 months, Approach 1 (flatten + LightGBM) will produce a working model. It is well-understood, interpretable, and your team already knows how to do it. The cost is time and ongoing maintenance. The accuracy ceiling is lower.

If you need better accuracy and have ML engineering capacity, Approach 2 (task-specific GNN) is a significant upgrade. It eliminates feature engineering, captures multi-hop patterns, and improves accuracy by 10-15 points on most tasks. It requires familiarity with graph ML frameworks and GPU infrastructure.

If you want the best accuracy with the least effort, Approach 3 (foundation model) delivers the highest accuracy with zero data science effort. One line of PQL replaces months of pipeline work. The tradeoff is that you are using a pre-trained model rather than a custom one, which means less control over model internals.

The direction of the field is clear: from manual to automatic, from flat to relational, from task-specific to foundation. The data has always been relational. The models are finally catching up.

Key Takeaways

1Approach 1 (flatten + LightGBM) takes 12.3 hours and 878 lines of code per task, achieving 62.44 AUROC on RelBench. Multi-hop patterns and temporal sequences are destroyed during aggregation.
2Approach 2 (graph neural networks) eliminates feature engineering entirely and achieves 75.83 AUROC by learning directly from the relational graph structure via message passing.
3Approach 3 (KumoRFM foundation model) delivers 76.71 AUROC zero-shot and 81.14 fine-tuned, with one line of PQL and predictions in seconds. Pre-training on thousands of databases enables cross-domain pattern transfer.
4The accuracy advantage of relational approaches comes from multi-hop patterns (3-4 table joins), temporal sequences, and graph topology that flat feature tables structurally cannot represent.
5For organizations running multiple prediction tasks on the same relational data, foundation models amortize the cost across all tasks. One model, one data connection, unlimited prediction questions.

How to Build ML on Relational Databases (Three Approaches Compared)