Open any database schema diagram. You will see boxes connected by lines. Tables connected by foreign keys. That is a graph. It has always been a graph. But for the entire history of machine learning, we have been ripping that graph apart, flattening it into spreadsheets, and feeding the flattened version to models that have no idea the graph ever existed.
Relational deep learning stops doing that. It takes your database, recognizes the graph structure that was always there, and trains directly on it. The result, published at ICML 2024 by researchers at Stanford and Kumo.ai, outperforms manual feature engineering on 11 of 12 classification benchmarks while taking a fraction of the time.
This is not a marginal improvement. It is a different way of thinking about what a database is and how models should consume it.
The graph that was always there
Take a simple e-commerce database. You have a customers table, an orders table, a products table, and a reviews table. The orders table has a foreign key to customers and another to products. The reviews table has a foreign key to customers and another to products.
Draw it out. Customer Alice placed Order #1042, which contained Product #887. Customer Bob also bought Product #887 and left a review. Alice and Bob are connected through a shared product. That product is connected to a category, which is connected to other products, which are connected to other customers.
This is a heterogeneous graph with four node types (customers, orders, products, reviews) and four edge types (placed, contains, wrote, about). Every relational database produces a graph like this. The more tables and foreign keys, the richer the graph.
Here is a concrete example from clinical trial data, one of the RelBench benchmark datasets. The graph connects studies to sites to patient outcomes.
studies
| study_id | title | phase | condition | start_date |
|---|---|---|---|---|
| NCT-4401 | ARIA-3 Cardiovascular | Phase III | Heart Failure | 2023-06-01 |
| NCT-4402 | BEACON Oncology | Phase II | Non-Small Cell Lung | 2024-01-15 |
| NCT-4403 | CLARITY Neuro | Phase III | Alzheimer's | 2023-09-10 |
sites
| site_id | study_id | institution | country | enrolled |
|---|---|---|---|---|
| SITE-01 | NCT-4401 | Cleveland Clinic | US | 142 |
| SITE-02 | NCT-4401 | Charite Berlin | Germany | 98 |
| SITE-03 | NCT-4402 | MD Anderson | US | 67 |
| SITE-04 | NCT-4403 | Mayo Clinic | US | 211 |
| SITE-05 | NCT-4403 | Karolinska Institute | Sweden | 84 |
outcomes
| outcome_id | site_id | endpoint | result | date |
|---|---|---|---|---|
| OUT-01 | SITE-01 | Primary: MACE reduction | Positive | 2025-06-15 |
| OUT-02 | SITE-02 | Primary: MACE reduction | Positive | 2025-07-02 |
| OUT-03 | SITE-03 | Primary: ORR | Inconclusive | 2025-09-20 |
| OUT-04 | SITE-04 | Primary: Cognitive decline | Negative | 2025-10-01 |
| OUT-05 | SITE-05 | Primary: Cognitive decline | Inconclusive | 2025-10-15 |
Highlighted: the Alzheimer's trial shows negative results at Mayo Clinic and inconclusive at Karolinska. A relational model sees the study-site-outcome graph and can predict trial success probability by propagating signals across sites.
Why flattening destroys information
Traditional ML requires a flat feature table: one row per entity, one column per feature. To get there from a relational database, you write SQL joins and aggregations. "Count of orders in the last 90 days." "Average order value." "Max product rating."
Three categories of information are destroyed in this process.
Multi-hop relationships
Consider the clinical trial data above. Suppose you want to predict whether a new site will meet its enrollment target. The signal spans multiple hops.
enrollment_performance
| site_id | study_id | target | actual | met_target |
|---|---|---|---|---|
| SITE-01 | NCT-4401 | 150 | 142 | No |
| SITE-02 | NCT-4401 | 100 | 98 | No |
| SITE-03 | NCT-4402 | 80 | 67 | No |
| SITE-04 | NCT-4403 | 200 | 211 | Yes |
| SITE-05 | NCT-4403 | 100 | 84 | No |
Highlighted: Mayo Clinic (SITE-04) exceeded its target. The question: will a new Mayo Clinic site on a future study also exceed target? The answer depends on the multi-hop path: new site to institution to historical sites to historical outcomes.
flat_feature_table (what LightGBM sees for a new site)
| site_id | institution | country | study_phase | condition |
|---|---|---|---|---|
| SITE-06 | Mayo Clinic | US | Phase II | Oncology |
One row with basic attributes. No information about Mayo Clinic's historical enrollment performance across other studies (211/200 on NCT-4403), or that Phase III Alzheimer's trials at Mayo tend to exceed targets while Phase II oncology trials at MD Anderson tend to underperform. These multi-hop institutional patterns are invisible.
Temporal sequences
“5 orders in 30 days” could mean steady weekly purchases, a burst followed by silence, or an accelerating pattern. The aggregate destroys the sequence. A graph that preserves timestamps on each edge retains the full temporal signal.
Structural patterns
Some nodes are hubs. Some are bridges between communities. Some form tight clusters. These topological features carry predictive signal that no column aggregation can capture. In fraud detection, for example, suspicious accounts often share structural signatures in the transaction graph that are invisible in flat feature tables.
Flat feature table
- One row per entity, losing all graph structure
- Aggregates destroy temporal sequences
- Multi-hop patterns invisible beyond 1-2 joins
- Human decides which features to engineer
- 878 lines of SQL per task (Stanford study)
Relational deep learning
- Full graph structure preserved across all tables
- Timestamps retained on every edge
- Model traverses 3-4 hop paths automatically
- Patterns discovered by the network, not a human
- Direct learning from raw relational schema
How RDL works
The RDL framework has three components, each of which maps directly to existing database concepts.
1. Graph construction
The database schema defines the graph. Rows become nodes, foreign keys become edges, column values become node features. Categorical columns are embedded, numerical columns are normalized, and timestamps are converted to positional encodings. This step is fully automated. Given a schema, the graph is deterministic.
2. Message passing
A graph neural network (GNN) learns by passing messages along edges. Each node aggregates information from its neighbors, then updates its own representation. After k layers of message passing, each node's representation encodes information from its k-hop neighborhood. With 3 layers, a customer node "knows" about its orders, the products in those orders, and the other customers who bought those same products.
3. Temporal filtering
This is where RDL diverges from standard GNN approaches. Every edge has a timestamp, and the model only passes messages along edges that occurred before the prediction time. This prevents data leakage and ensures the model learns causal patterns, not future information. It also means the same graph supports different prediction horizons without reprocessing.
The RelBench evidence
Claims about ML methods are easy to make and hard to verify. The RDL paper addressed this by releasing RelBench, a benchmark designed specifically for ML on relational databases. It includes:
- 7 databases spanning e-commerce (Amazon), event planning (Stack Exchange), healthcare (clinical trials), sports (F1), and social networks
- 30 prediction tasks across classification and regression
- 103 million+ rows of real-world data
- Temporal train/validation/test splits to prevent leakage
The benchmark compared four approaches on these tasks:
- LightGBM with manual features (engineered by a Stanford-trained data scientist): 62.44 average AUROC on classification tasks
- LLM on serialized tables (Llama 3.2 3B): 68.06 AUROC
- Task-specific GNN (trained from scratch per task): 75.83 AUROC
- KumoRFM zero-shot (no task-specific training): 76.71 AUROC
PQL Query
PREDICT outcomes.result = 'Positive' FOR EACH studies.study_id
The model reads studies, sites, and outcomes as a graph. It discovers that site-level enrollment velocity, cross-study institution performance, and endpoint type all propagate to predict trial success.
Output
| study_id | success_probability | top_signal |
|---|---|---|
| NCT-4401 | 0.84 | Positive results at both sites, strong enrollment |
| NCT-4402 | 0.47 | Single site, inconclusive interim data |
| NCT-4403 | 0.18 | Negative at primary site, Alzheimer's high failure rate |
From task-specific GNNs to a foundation model
The RDL paper proved the concept: graph neural networks outperform manual feature engineering on relational data. But training a new GNN from scratch for each task still requires ML expertise, GPU resources, and 30 minutes of compute.
KumoRFM is the foundation model extension of this idea. Pre-trained on billions of relational patterns across thousands of diverse databases, it learns universal primitives: recency effects, frequency patterns, temporal dynamics, graph topology, cross-table propagation. At inference time, it reads your schema, constructs the graph, and generates predictions without any task-specific training.
The analogy to language models is direct. GPT was pre-trained on text from the entire internet and can answer questions about new documents it has never seen. KumoRFM was pre-trained on relational data from thousands of databases and can make predictions on new databases it has never seen. Both work because the underlying patterns (grammar in text, relational dynamics in databases) are transferable.
On RelBench, KumoRFM zero-shot (76.71 AUROC) outperformed the task-specific GNN (75.83) without any training on the target database. Fine-tuning pushed performance to 81.14 AUROC, a 24% relative improvement over the best manual approach.
What this changes in practice
If you run a data science team today, the implications are concrete.
The feature engineering step disappears. Not automated, not accelerated. Gone. The model consumes the database directly. Your team stops writing SQL joins and starts asking questions in Predictive Query Language (PQL): "For each customer, what is the probability of churn in the next 30 days?" One line. One second. Done.
New prediction tasks take seconds, not months. When a business stakeholder asks "can we predict which accounts will expand next quarter?", the answer is not "we will scope a 3-month project." The answer is "let me run that query." The foundation model already understands the relational patterns. It just needs the question.
The accuracy ceiling rises because the model sees more. A human data scientist exploring a 15-table database will test maybe 200 feature combinations. The model traverses the full graph. It finds the 4-hop patterns, the temporal sequences, and the structural signatures that no human would enumerate.
DoorDash deployed this on 30 million users and measured a 1.8% engagement lift. Snowflake saw a 3.2x expansion revenue lift. Databricks measured a 5.4x conversion lift. These numbers came from patterns that were always in the data, hiding in the graph structure that traditional ML could not see.
Your database has been a graph this entire time. The only question is whether your ML stack treats it like one.