What is relational deep learning?

Relational deep learning (RDL) is a framework for training machine learning models directly on relational databases by representing them as temporal heterogeneous graphs. Rows become nodes, foreign keys become edges, and timestamps order the graph in time. Instead of manually flattening tables into feature vectors, the model learns predictive patterns by traversing the graph structure. RDL was published at ICML 2024 by researchers at Stanford and Kumo.ai.

How does RDL represent a relational database as a graph?

Every row in every table becomes a node. Every foreign key relationship becomes a directed edge connecting the referencing row to the referenced row. Columns become node attributes. Timestamps on rows establish temporal ordering, so the model never sees future data during training. A database with 10 tables and 6 foreign keys produces a heterogeneous graph with 10 node types and 6 edge types.

What is the RelBench benchmark?

RelBench is a standardized benchmark for evaluating machine learning on relational databases. It contains 7 databases spanning domains like e-commerce, healthcare, sports, and social networks, with 30 prediction tasks and over 103 million rows of data. Tasks include classification (e.g., will this customer churn?) and regression (e.g., what will this product's rating be?). It was released alongside the RDL paper to provide reproducible comparisons.

How does RDL compare to traditional feature engineering?

On the RelBench benchmark, a graph neural network trained using RDL achieved 75.83 average AUROC on classification tasks, compared to 62.44 for LightGBM with manual features engineered by a Stanford-trained data scientist. The GNN outperformed manual features on 11 of 12 classification tasks. More importantly, the GNN approach took roughly 30 minutes per task compared to 12.3 hours for manual feature engineering.

What is the difference between RDL and KumoRFM?

RDL is the scientific framework that shows how to train models directly on relational data by converting databases to graphs. KumoRFM is a foundation model built on that framework. Where RDL trains a new GNN from scratch for each task (roughly 30 minutes), KumoRFM is pre-trained on billions of relational patterns and delivers zero-shot predictions in about 1 second. KumoRFM achieved 76.71 AUROC zero-shot on RelBench, outperforming even task-specific GNNs.

Relational Deep Learning: Why Your Database Is Already a Graph | Kumo.ai

Open any database schema diagram. You will see boxes connected by lines. Tables connected by foreign keys. That is a graph. It has always been a graph. But for the entire history of machine learning, we have been ripping that graph apart, flattening it into spreadsheets, and feeding the flattened version to models that have no idea the graph ever existed.

Relational deep learning stops doing that. It takes your database, recognizes the graph structure that was always there, and trains directly on it. The result, published at ICML 2024 by researchers at Stanford and Kumo.ai, outperforms manual feature engineering on 11 of 12 classification benchmarks while taking a fraction of the time.

This is not a marginal improvement. It is a different way of thinking about what a database is and how models should consume it.

The graph that was always there

Take a simple e-commerce database. You have a customers table, an orders table, a products table, and a reviews table. The orders table has a foreign key to customers and another to products. The reviews table has a foreign key to customers and another to products.

Draw it out. Customer Alice placed Order #1042, which contained Product #887. Customer Bob also bought Product #887 and left a review. Alice and Bob are connected through a shared product. That product is connected to a category, which is connected to other products, which are connected to other customers.

This is a heterogeneous graph with four node types (customers, orders, products, reviews) and four edge types (placed, contains, wrote, about). Every relational database produces a graph like this. The more tables and foreign keys, the richer the graph.

Here is a concrete example from clinical trial data, one of the RelBench benchmark datasets. The graph connects studies to sites to patient outcomes.

studies

study_id	title	phase	condition	start_date
NCT-4401	ARIA-3 Cardiovascular	Phase III	Heart Failure	2023-06-01
NCT-4402	BEACON Oncology	Phase II	Non-Small Cell Lung	2024-01-15
NCT-4403	CLARITY Neuro	Phase III	Alzheimer's	2023-09-10

sites

site_id	study_id	institution	country	enrolled
SITE-01	NCT-4401	Cleveland Clinic	US	142
SITE-02	NCT-4401	Charite Berlin	Germany	98
SITE-03	NCT-4402	MD Anderson	US	67
SITE-04	NCT-4403	Mayo Clinic	US	211
SITE-05	NCT-4403	Karolinska Institute	Sweden	84

outcomes

outcome_id	site_id	endpoint	result	date
OUT-01	SITE-01	Primary: MACE reduction	Positive	2025-06-15
OUT-02	SITE-02	Primary: MACE reduction	Positive	2025-07-02
OUT-03	SITE-03	Primary: ORR	Inconclusive	2025-09-20
OUT-04	SITE-04	Primary: Cognitive decline	Negative	2025-10-01
OUT-05	SITE-05	Primary: Cognitive decline	Inconclusive	2025-10-15

Highlighted: the Alzheimer's trial shows negative results at Mayo Clinic and inconclusive at Karolinska. A relational model sees the study-site-outcome graph and can predict trial success probability by propagating signals across sites.

Why flattening destroys information

Traditional ML requires a flat feature table: one row per entity, one column per feature. To get there from a relational database, you write SQL joins and aggregations. "Count of orders in the last 90 days." "Average order value." "Max product rating."

Three categories of information are destroyed in this process.

Multi-hop relationships

Consider the clinical trial data above. Suppose you want to predict whether a new site will meet its enrollment target. The signal spans multiple hops.

enrollment_performance

site_id	study_id	target	actual	met_target
SITE-01	NCT-4401	150	142	No
SITE-02	NCT-4401	100	98	No
SITE-03	NCT-4402	80	67	No
SITE-04	NCT-4403	200	211	Yes
SITE-05	NCT-4403	100	84	No

Highlighted: Mayo Clinic (SITE-04) exceeded its target. The question: will a new Mayo Clinic site on a future study also exceed target? The answer depends on the multi-hop path: new site to institution to historical sites to historical outcomes.

flat_feature_table (what LightGBM sees for a new site)

site_id	institution	country	study_phase	condition
SITE-06	Mayo Clinic	US	Phase II	Oncology

One row with basic attributes. No information about Mayo Clinic's historical enrollment performance across other studies (211/200 on NCT-4403), or that Phase III Alzheimer's trials at Mayo tend to exceed targets while Phase II oncology trials at MD Anderson tend to underperform. These multi-hop institutional patterns are invisible.

Temporal sequences

“5 orders in 30 days” could mean steady weekly purchases, a burst followed by silence, or an accelerating pattern. The aggregate destroys the sequence. A graph that preserves timestamps on each edge retains the full temporal signal.

Structural patterns

Some nodes are hubs. Some are bridges between communities. Some form tight clusters. These topological features carry predictive signal that no column aggregation can capture. In fraud detection, for example, suspicious accounts often share structural signatures in the transaction graph that are invisible in flat feature tables.

Flat feature table

One row per entity, losing all graph structure
Aggregates destroy temporal sequences
Multi-hop patterns invisible beyond 1-2 joins
Human decides which features to engineer
878 lines of SQL per task (Stanford study)

Relational deep learning

Full graph structure preserved across all tables
Timestamps retained on every edge
Model traverses 3-4 hop paths automatically
Patterns discovered by the network, not a human
Direct learning from raw relational schema

How RDL works

The RDL framework has three components, each of which maps directly to existing database concepts.

1. Graph construction

The database schema defines the graph. Rows become nodes, foreign keys become edges, column values become node features. Categorical columns are embedded, numerical columns are normalized, and timestamps are converted to positional encodings. This step is fully automated. Given a schema, the graph is deterministic.

2. Message passing

A graph neural network (GNN) learns by passing messages along edges. Each node aggregates information from its neighbors, then updates its own representation. After k layers of message passing, each node's representation encodes information from its k-hop neighborhood. With 3 layers, a customer node "knows" about its orders, the products in those orders, and the other customers who bought those same products.

3. Temporal filtering

This is where RDL diverges from standard GNN approaches. Every edge has a timestamp, and the model only passes messages along edges that occurred before the prediction time. This prevents data leakage and ensures the model learns causal patterns, not future information. It also means the same graph supports different prediction horizons without reprocessing.

The RelBench evidence

Claims about ML methods are easy to make and hard to verify. The RDL paper addressed this by releasing RelBench, a benchmark designed specifically for ML on relational databases. It includes:

7 databases spanning e-commerce (Amazon), event planning (Stack Exchange), healthcare (clinical trials), sports (F1), and social networks
30 prediction tasks across classification and regression
103 million+ rows of real-world data
Temporal train/validation/test splits to prevent leakage

The benchmark compared four approaches on these tasks:

LightGBM with manual features (engineered by a Stanford-trained data scientist): 62.44 average AUROC on classification tasks
LLM on serialized tables (Llama 3.2 3B): 68.06 AUROC
Task-specific GNN (trained from scratch per task): 75.83 AUROC
KumoRFM zero-shot (no task-specific training): 76.71 AUROC

PQL Query

PREDICT outcomes.result = 'Positive'
FOR EACH studies.study_id

The model reads studies, sites, and outcomes as a graph. It discovers that site-level enrollment velocity, cross-study institution performance, and endpoint type all propagate to predict trial success.

Output

study_id	success_probability	top_signal
NCT-4401	0.84	Positive results at both sites, strong enrollment
NCT-4402	0.47	Single site, inconclusive interim data
NCT-4403	0.18	Negative at primary site, Alzheimer's high failure rate

From task-specific GNNs to a foundation model

The RDL paper proved the concept: graph neural networks outperform manual feature engineering on relational data. But training a new GNN from scratch for each task still requires ML expertise, GPU resources, and 30 minutes of compute.

KumoRFM is the foundation model extension of this idea. Pre-trained on billions of relational patterns across thousands of diverse databases, it learns universal primitives: recency effects, frequency patterns, temporal dynamics, graph topology, cross-table propagation. At inference time, it reads your schema, constructs the graph, and generates predictions without any task-specific training.

The analogy to language models is direct. GPT was pre-trained on text from the entire internet and can answer questions about new documents it has never seen. KumoRFM was pre-trained on relational data from thousands of databases and can make predictions on new databases it has never seen. Both work because the underlying patterns (grammar in text, relational dynamics in databases) are transferable.

On RelBench, KumoRFM zero-shot (76.71 AUROC) outperformed the task-specific GNN (75.83) without any training on the target database. Fine-tuning pushed performance to 81.14 AUROC, a 24% relative improvement over the best manual approach.

What this changes in practice

If you run a data science team today, the implications are concrete.

The feature engineering step disappears. Not automated, not accelerated. Gone. The model consumes the database directly. Your team stops writing SQL joins and starts asking questions in Predictive Query Language (PQL): "For each customer, what is the probability of churn in the next 30 days?" One line. One second. Done.

New prediction tasks take seconds, not months. When a business stakeholder asks "can we predict which accounts will expand next quarter?", the answer is not "we will scope a 3-month project." The answer is "let me run that query." The foundation model already understands the relational patterns. It just needs the question.

The accuracy ceiling rises because the model sees more. A human data scientist exploring a 15-table database will test maybe 200 feature combinations. The model traverses the full graph. It finds the 4-hop patterns, the temporal sequences, and the structural signatures that no human would enumerate.

DoorDash deployed this on 30 million users and measured a 1.8% engagement lift. Snowflake saw a 3.2x expansion revenue lift. Databricks measured a 5.4x conversion lift. These numbers came from patterns that were always in the data, hiding in the graph structure that traditional ML could not see.

Your database has been a graph this entire time. The only question is whether your ML stack treats it like one.

Key Takeaways

1Every relational database is a graph: rows become nodes, foreign keys become edges, timestamps establish temporal ordering. The mapping is mechanical and deterministic.
2Flattening destroys three categories of information: multi-hop relationships (3-4 table paths), temporal sequences (order of events), and structural patterns (graph topology).
3On RelBench (7 databases, 30 tasks, 103M+ rows), a GNN trained via RDL scored 75.83 AUROC versus 62.44 for LightGBM with manual features, winning 11 of 12 classification tasks.
4Temporal filtering is critical: RDL only passes messages along edges that occurred before the prediction time, preventing data leakage and ensuring the model learns causal patterns.
5KumoRFM extends RDL into a foundation model: pre-trained on billions of relational patterns, it delivers 76.71 AUROC zero-shot in 1 second, versus 12.3 hours for manual feature engineering.

Relational Deep Learning: Why Your Database Is Already a Graph