Berlin Tech Meetup: The Future of Relational Foundation Models, Systems, and Real-World Applications

Register now:
Learn14 min read

Relational Deep Learning: Why Your Database Is Already a Graph

Every relational database is a graph hiding in plain sight. Foreign keys are edges. Rows are nodes. RDL formalizes what should have been obvious: train the model on the graph, not on a flattened shadow of it.

TL;DR

  • 1Every relational database is a graph hiding in plain sight. Rows become nodes, foreign keys become edges, timestamps establish temporal ordering. The mapping is mechanical.
  • 2Published at ICML 2024 by Stanford and Kumo.ai researchers, RDL outperforms manual feature engineering on 11 of 12 classification benchmarks on RelBench.
  • 3Flattening destroys three signal categories: multi-hop relationships (3-4 table paths), temporal sequences (order of events), and structural patterns (graph topology).
  • 4On RelBench (7 databases, 30 tasks, 103M+ rows), GNN via RDL scored 75.83 AUROC vs 62.44 for LightGBM. Time: 30 minutes vs 12.3 hours per task.
  • 5KumoRFM extends RDL into a foundation model: pre-trained on billions of relational patterns, it delivers 76.71 AUROC zero-shot in about 1 second per task.

Open any database schema diagram. You will see boxes connected by lines. Tables connected by foreign keys. That is a graph. It has always been a graph. But for the entire history of machine learning, we have been ripping that graph apart, flattening it into spreadsheets, and feeding the flattened version to models that have no idea the graph ever existed.

Relational deep learning stops doing that. It takes your database, recognizes the graph structure that was always there, and trains directly on it. The result, published at ICML 2024 by researchers at Stanford and Kumo.ai, outperforms manual feature engineering on 11 of 12 classification benchmarks while taking a fraction of the time.

This is not a marginal improvement. It is a different way of thinking about what a database is and how models should consume it.

The graph that was always there

Take a simple e-commerce database. You have a customers table, an orders table, a products table, and a reviews table. The orders table has a foreign key to customers and another to products. The reviews table has a foreign key to customers and another to products.

Draw it out. Customer Alice placed Order #1042, which contained Product #887. Customer Bob also bought Product #887 and left a review. Alice and Bob are connected through a shared product. That product is connected to a category, which is connected to other products, which are connected to other customers.

This is a heterogeneous graph with four node types (customers, orders, products, reviews) and four edge types (placed, contains, wrote, about). Every relational database produces a graph like this. The more tables and foreign keys, the richer the graph.

Here is a concrete example from clinical trial data, one of the RelBench benchmark datasets. The graph connects studies to sites to patient outcomes.

studies

study_idtitlephaseconditionstart_date
NCT-4401ARIA-3 CardiovascularPhase IIIHeart Failure2023-06-01
NCT-4402BEACON OncologyPhase IINon-Small Cell Lung2024-01-15
NCT-4403CLARITY NeuroPhase IIIAlzheimer's2023-09-10

sites

site_idstudy_idinstitutioncountryenrolled
SITE-01NCT-4401Cleveland ClinicUS142
SITE-02NCT-4401Charite BerlinGermany98
SITE-03NCT-4402MD AndersonUS67
SITE-04NCT-4403Mayo ClinicUS211
SITE-05NCT-4403Karolinska InstituteSweden84

outcomes

outcome_idsite_idendpointresultdate
OUT-01SITE-01Primary: MACE reductionPositive2025-06-15
OUT-02SITE-02Primary: MACE reductionPositive2025-07-02
OUT-03SITE-03Primary: ORRInconclusive2025-09-20
OUT-04SITE-04Primary: Cognitive declineNegative2025-10-01
OUT-05SITE-05Primary: Cognitive declineInconclusive2025-10-15

Highlighted: the Alzheimer's trial shows negative results at Mayo Clinic and inconclusive at Karolinska. A relational model sees the study-site-outcome graph and can predict trial success probability by propagating signals across sites.

Why flattening destroys information

Traditional ML requires a flat feature table: one row per entity, one column per feature. To get there from a relational database, you write SQL joins and aggregations. "Count of orders in the last 90 days." "Average order value." "Max product rating."

Three categories of information are destroyed in this process.

Multi-hop relationships

Consider the clinical trial data above. Suppose you want to predict whether a new site will meet its enrollment target. The signal spans multiple hops.

enrollment_performance

site_idstudy_idtargetactualmet_target
SITE-01NCT-4401150142No
SITE-02NCT-440110098No
SITE-03NCT-44028067No
SITE-04NCT-4403200211Yes
SITE-05NCT-440310084No

Highlighted: Mayo Clinic (SITE-04) exceeded its target. The question: will a new Mayo Clinic site on a future study also exceed target? The answer depends on the multi-hop path: new site to institution to historical sites to historical outcomes.

flat_feature_table (what LightGBM sees for a new site)

site_idinstitutioncountrystudy_phasecondition
SITE-06Mayo ClinicUSPhase IIOncology

One row with basic attributes. No information about Mayo Clinic's historical enrollment performance across other studies (211/200 on NCT-4403), or that Phase III Alzheimer's trials at Mayo tend to exceed targets while Phase II oncology trials at MD Anderson tend to underperform. These multi-hop institutional patterns are invisible.

Temporal sequences

“5 orders in 30 days” could mean steady weekly purchases, a burst followed by silence, or an accelerating pattern. The aggregate destroys the sequence. A graph that preserves timestamps on each edge retains the full temporal signal.

Structural patterns

Some nodes are hubs. Some are bridges between communities. Some form tight clusters. These topological features carry predictive signal that no column aggregation can capture. In fraud detection, for example, suspicious accounts often share structural signatures in the transaction graph that are invisible in flat feature tables.

Flat feature table

  • One row per entity, losing all graph structure
  • Aggregates destroy temporal sequences
  • Multi-hop patterns invisible beyond 1-2 joins
  • Human decides which features to engineer
  • 878 lines of SQL per task (Stanford study)

Relational deep learning

  • Full graph structure preserved across all tables
  • Timestamps retained on every edge
  • Model traverses 3-4 hop paths automatically
  • Patterns discovered by the network, not a human
  • Direct learning from raw relational schema

How RDL works

The RDL framework has three components, each of which maps directly to existing database concepts.

1. Graph construction

The database schema defines the graph. Rows become nodes, foreign keys become edges, column values become node features. Categorical columns are embedded, numerical columns are normalized, and timestamps are converted to positional encodings. This step is fully automated. Given a schema, the graph is deterministic.

2. Message passing

A graph neural network (GNN) learns by passing messages along edges. Each node aggregates information from its neighbors, then updates its own representation. After k layers of message passing, each node's representation encodes information from its k-hop neighborhood. With 3 layers, a customer node "knows" about its orders, the products in those orders, and the other customers who bought those same products.

3. Temporal filtering

This is where RDL diverges from standard GNN approaches. Every edge has a timestamp, and the model only passes messages along edges that occurred before the prediction time. This prevents data leakage and ensures the model learns causal patterns, not future information. It also means the same graph supports different prediction horizons without reprocessing.

The RelBench evidence

Claims about ML methods are easy to make and hard to verify. The RDL paper addressed this by releasing RelBench, a benchmark designed specifically for ML on relational databases. It includes:

  • 7 databases spanning e-commerce (Amazon), event planning (Stack Exchange), healthcare (clinical trials), sports (F1), and social networks
  • 30 prediction tasks across classification and regression
  • 103 million+ rows of real-world data
  • Temporal train/validation/test splits to prevent leakage

The benchmark compared four approaches on these tasks:

  • LightGBM with manual features (engineered by a Stanford-trained data scientist): 62.44 average AUROC on classification tasks
  • LLM on serialized tables (Llama 3.2 3B): 68.06 AUROC
  • Task-specific GNN (trained from scratch per task): 75.83 AUROC
  • KumoRFM zero-shot (no task-specific training): 76.71 AUROC

PQL Query

PREDICT outcomes.result = 'Positive'
FOR EACH studies.study_id

The model reads studies, sites, and outcomes as a graph. It discovers that site-level enrollment velocity, cross-study institution performance, and endpoint type all propagate to predict trial success.

Output

study_idsuccess_probabilitytop_signal
NCT-44010.84Positive results at both sites, strong enrollment
NCT-44020.47Single site, inconclusive interim data
NCT-44030.18Negative at primary site, Alzheimer's high failure rate

From task-specific GNNs to a foundation model

The RDL paper proved the concept: graph neural networks outperform manual feature engineering on relational data. But training a new GNN from scratch for each task still requires ML expertise, GPU resources, and 30 minutes of compute.

KumoRFM is the foundation model extension of this idea. Pre-trained on billions of relational patterns across thousands of diverse databases, it learns universal primitives: recency effects, frequency patterns, temporal dynamics, graph topology, cross-table propagation. At inference time, it reads your schema, constructs the graph, and generates predictions without any task-specific training.

The analogy to language models is direct. GPT was pre-trained on text from the entire internet and can answer questions about new documents it has never seen. KumoRFM was pre-trained on relational data from thousands of databases and can make predictions on new databases it has never seen. Both work because the underlying patterns (grammar in text, relational dynamics in databases) are transferable.

On RelBench, KumoRFM zero-shot (76.71 AUROC) outperformed the task-specific GNN (75.83) without any training on the target database. Fine-tuning pushed performance to 81.14 AUROC, a 24% relative improvement over the best manual approach.

What this changes in practice

If you run a data science team today, the implications are concrete.

The feature engineering step disappears. Not automated, not accelerated. Gone. The model consumes the database directly. Your team stops writing SQL joins and starts asking questions in Predictive Query Language (PQL): "For each customer, what is the probability of churn in the next 30 days?" One line. One second. Done.

New prediction tasks take seconds, not months. When a business stakeholder asks "can we predict which accounts will expand next quarter?", the answer is not "we will scope a 3-month project." The answer is "let me run that query." The foundation model already understands the relational patterns. It just needs the question.

The accuracy ceiling rises because the model sees more. A human data scientist exploring a 15-table database will test maybe 200 feature combinations. The model traverses the full graph. It finds the 4-hop patterns, the temporal sequences, and the structural signatures that no human would enumerate.

DoorDash deployed this on 30 million users and measured a 1.8% engagement lift. Snowflake saw a 3.2x expansion revenue lift. Databricks measured a 5.4x conversion lift. These numbers came from patterns that were always in the data, hiding in the graph structure that traditional ML could not see.

Your database has been a graph this entire time. The only question is whether your ML stack treats it like one.

Frequently asked questions

What is relational deep learning?

Relational deep learning (RDL) is a framework for training machine learning models directly on relational databases by representing them as temporal heterogeneous graphs. Rows become nodes, foreign keys become edges, and timestamps order the graph in time. Instead of manually flattening tables into feature vectors, the model learns predictive patterns by traversing the graph structure. RDL was published at ICML 2024 by researchers at Stanford and Kumo.ai.

How does RDL represent a relational database as a graph?

Every row in every table becomes a node. Every foreign key relationship becomes a directed edge connecting the referencing row to the referenced row. Columns become node attributes. Timestamps on rows establish temporal ordering, so the model never sees future data during training. A database with 10 tables and 6 foreign keys produces a heterogeneous graph with 10 node types and 6 edge types.

What is the RelBench benchmark?

RelBench is a standardized benchmark for evaluating machine learning on relational databases. It contains 7 databases spanning domains like e-commerce, healthcare, sports, and social networks, with 30 prediction tasks and over 103 million rows of data. Tasks include classification (e.g., will this customer churn?) and regression (e.g., what will this product's rating be?). It was released alongside the RDL paper to provide reproducible comparisons.

How does RDL compare to traditional feature engineering?

On the RelBench benchmark, a graph neural network trained using RDL achieved 75.83 average AUROC on classification tasks, compared to 62.44 for LightGBM with manual features engineered by a Stanford-trained data scientist. The GNN outperformed manual features on 11 of 12 classification tasks. More importantly, the GNN approach took roughly 30 minutes per task compared to 12.3 hours for manual feature engineering.

What is the difference between RDL and KumoRFM?

RDL is the scientific framework that shows how to train models directly on relational data by converting databases to graphs. KumoRFM is a foundation model built on that framework. Where RDL trains a new GNN from scratch for each task (roughly 30 minutes), KumoRFM is pre-trained on billions of relational patterns and delivers zero-shot predictions in about 1 second. KumoRFM achieved 76.71 AUROC zero-shot on RelBench, outperforming even task-specific GNNs.

See it in action

KumoRFM delivers predictions on relational data in seconds. No feature engineering, no ML pipelines. Try it free.