What is graph ML and how does it differ from traditional ML?

Graph ML is a family of machine learning techniques that operate directly on graph-structured data, where entities are nodes and relationships are edges. Traditional ML requires flattening data into rows and columns. Graph ML preserves the relational structure, allowing models to learn from connectivity patterns, multi-hop relationships, and network effects that flat tables cannot represent.

Which companies use graph ML in production?

DoorDash uses graph ML for recommendations across 30 million users, achieving a 1.8% engagement lift. Pinterest uses it for content recommendations serving 450 million monthly active users. Visa uses it for fraud detection across billions of transactions. Snowflake uses it internally for expansion revenue prediction with a 3.2x lift over baseline models.

What data is required for graph ML?

Graph ML requires relational data with defined connections between entities. Any relational database qualifies: customers linked to orders, orders linked to products, products linked to categories. The minimum useful graph has two entity types and one relationship. Production graphs typically have 5 to 50 entity types and millions to billions of edges.

How long does it take to deploy graph ML in production?

Building a custom GNN pipeline from scratch takes 3 to 6 months for a team of 2 to 3 ML engineers. Using a pre-trained relational foundation model like KumoRFM reduces this to days, because the model already understands relational patterns and requires no graph construction, feature engineering, or model training.

Does graph ML replace existing ML systems?

Graph ML supplements or replaces the feature engineering and modeling layers of existing ML systems. Data pipelines, monitoring, and serving infrastructure remain. In practice, most enterprises run graph ML predictions alongside existing models and gradually shift traffic as they validate performance gains on their specific use cases.

Graph ML for Enterprise: A Practical Guide | Kumo.ai

In 2021, DoorDash published a blog post describing how they rebuilt their recommendation system using graph neural networks. The result was a 1.8% engagement lift across 30 million users. That may sound small until you calculate the revenue impact at DoorDash's scale: roughly $50 million in incremental annual GMV.

Two years later, graph ML is running in production at Pinterest (content recommendations for 450 million monthly users), Visa (fraud detection on billions of transactions), Snap (friend suggestions), and dozens of Fortune 500 companies that do not publicize their implementations. Snowflake used it internally for expansion revenue prediction and saw a 3.2x lift over their previous gradient-boosted model.

Yet most enterprise data science teams have never shipped a graph ML model. The gap is not technical. It is informational. Teams do not know when graph ML adds value, what the evaluation criteria should be, or what deployment actually requires. This guide addresses all three questions.

What graph ML actually does

Every machine learning model needs to find patterns in data. Traditional models (logistic regression, XGBoost, random forests) find patterns in flat tables: rows of features, one per entity. Graph ML finds patterns in connected structures: entities as nodes, relationships as edges.

The distinction matters because most enterprise data is inherently relational. A customer is connected to orders, orders to products, products to categories, categories to seasonal trends. A patient is connected to diagnoses, prescriptions, lab results, providers, and insurance claims. These connections carry predictive signal that flat tables cannot represent.

Consider churn prediction. A traditional model might use features like days since last purchase, total spend, and number of support tickets. A graph ML model sees that plus the following: the customer's neighbors (people who bought similar products) are churning at 3x the normal rate. The products they recently purchased have a 40% return rate. The support agent they interacted with has a resolution rate 20 points below average.

None of those signals exist in a flat feature table unless someone manually engineers them. And nobody does, because the combinatorial space of possible multi-hop features is too large to explore by hand.

production_graph_ml_deployments

Company	Use Case	Graph Scale	Result	Year
DoorDash	Recommendations	30M users, heterogeneous	1.8% engagement lift	2021
Pinterest	Content recommendations	18B pins, 450M MAU	Core ranking system	2018
Visa	Fraud detection	Billions of transactions	Fraud ring detection	2020
Snowflake	Expansion revenue	Accounts-users-queries graph	3.2x lift over GBT	2023
Databricks	Lead scoring	Companies-contacts-usage graph	5.4x conversion lift	2023

These are published production deployments, not research prototypes. Graph ML is running at Fortune 500 scale.

How GNNs learn from structure

A graph neural network works through message passing. Each node collects information from its neighbors, aggregates it, and updates its own representation. After multiple rounds, each node's embedding encodes information from its entire local neighborhood, not just its own attributes.

With 3 rounds of message passing, a node's representation captures information from every entity within 3 hops. For a customer in an e-commerce graph, that includes their orders, the products in those orders, other customers who bought those products, and those customers' order histories. That is the kind of signal that takes a data scientist weeks to engineer manually and typically still misses the most informative patterns.

gnn_message_passing_concrete_example

Layer	Customer C-401 Sees	New Information	Embedding Updates
Input	Own attributes only	segment=Enterprise, tenure=3yr, ARR=$180K	Initial node vector
Layer 1	5 orders, 2 support tickets	avg_order=$12K, 1 escalated ticket	Adds interaction patterns
Layer 2	8 products, 3 support agents, 4 invoices	2 products have 30%+ churn rate among buyers	Adds product-risk signal
Layer 3	52 similar customers (via shared products)	38 of 52 are active; 14 churned last quarter	Adds peer-behavior signal

By layer 3, the model knows that 27% of similar customers (those who bought the same products) churned recently. This peer-churn signal is the strongest predictor but requires traversing 3 tables -- no flat feature table captures it.

When graph ML adds value over traditional approaches

Graph ML is not universally superior to traditional ML. There are specific conditions where it provides measurable uplift, and conditions where the added complexity is not justified.

High-value scenarios

Graph ML consistently outperforms traditional approaches in four situations:

Multi-table relational data. If your prediction depends on information spread across 3 or more tables, graph ML eliminates the feature engineering bottleneck. The RelBench benchmark showed that GNNs outperformed LightGBM with manual features on 11 of 12 classification tasks across 7 multi-table databases.
Network effects matter. Fraud detection, social recommendations, and marketplace dynamics all depend on how entities relate to each other. A fraudulent transaction is not just about the transaction attributes; it is about the network of accounts, devices, and merchants connected to it.
Cold-start problems. New users with no history are impossible for traditional models. Graph ML can predict their behavior based on the entities they are connected to: the product they first viewed, the channel they came from, the referrer who invited them.
Feature engineering is the bottleneck. If your data science team spends 60% or more of their time writing SQL joins and aggregations, graph ML eliminates that step entirely. The Stanford study measured the cost: 12.3 hours and 878 lines of code per prediction task for experienced data scientists.

Low-value scenarios

Graph ML adds less value in these situations:

Single-table data. If your prediction depends on one flat table with well-defined features, XGBoost or a neural network will perform comparably with less infrastructure.
Extremely sparse graphs. If entities have very few connections (less than 2 edges per node on average), the graph structure carries minimal signal.
Real-time latency under 10ms. GNN inference on large graphs can take 50 to 200ms. If you need sub-10ms latency, you will need a pre-computed embedding approach or a simpler model.

Traditional ML approach

Flatten 10-50 tables into a single feature table
Engineer 100-500 features manually per use case
Miss multi-hop and network-effect signals
Rebuild pipeline for every new prediction task
3-6 month cycle per model

Graph ML approach

Represent database as a graph directly
Model learns features from relational structure
Captures multi-hop, temporal, and network patterns
Same architecture handles any prediction task
Days to weeks per model with foundation models

graph_ml_value_by_scenario

Scenario	Traditional ML AUROC	Graph ML AUROC	Uplift	Why Graph Wins
Multi-table relational	62.44	75.83	+13.4 pts	Cross-table patterns
Fraud with ring patterns	71.2	78.4	+7.2 pts	Network topology
Cold-start users	~50 (random)	65-75	+15-25 pts	Neighbor signal propagation
Single flat table	82-85	82-85	~0 pts	No structural advantage
Sparse graph (<2 edges/node)	75-80	76-81	+1-2 pts	Minimal graph signal

Graph ML's advantage scales with relational complexity. On single-table data, XGBoost remains competitive.

Evaluating graph ML: the right benchmarks

Most ML benchmarks use single-table datasets (UCI, Kaggle competitions) where graph ML has no structural advantage. To evaluate graph ML properly, you need benchmarks designed for relational data.

RelBench: the standard benchmark

RelBench is the first standardized benchmark for ML on relational databases. Published at NeurIPS 2024 by researchers at Stanford and Kumo.ai, it includes 7 databases, 30 prediction tasks, and over 103 million rows. Each database has 3 to 15 interconnected tables. Tasks include classification (churn, fraud, conversion) and regression (lifetime value, demand forecasting).

The benchmark enforces temporal splits: training data comes before the evaluation period, and test data comes after. This prevents data leakage, which inflates accuracy in many published results. On RelBench, GNNs achieve an average AUROC of 75.83 on classification tasks, compared to 62.44 for LightGBM with features engineered by a Stanford-trained data scientist. KumoRFM zero-shot reaches 76.71, and fine-tuned reaches 81.14.

What to measure for your use case

Do not evaluate graph ML on aggregate metrics alone. The real value shows up in specific slices:

Cold-start entities. Measure accuracy on entities with fewer than 5 historical events. This is where graph ML's structural advantage is largest, often 15 to 30 percentage points of AUROC improvement.
Multi-hop signal tasks. Pick a prediction where the correct answer depends on information 2 or more hops away. For example, predicting whether a customer will return a product based on the return rates of similar products bought by similar customers.
Temporal dynamics. Measure on tasks where the pattern changes over time. Graph ML with temporal encoding captures shifts that static feature tables miss.

PQL Query

PREDICT COUNT(transactions.*, 0, 7) > 3
  AND AVG(transactions.amount, 0, 7) > 5 * AVG(transactions.amount, 0, 90)
FOR EACH accounts.account_id

Fraud detection via PQL. The model uses graph structure to identify accounts with anomalous transaction patterns relative to their network of connected merchants, devices, and counterparties.

Output

account_id	fraud_risk	confidence	graph_signal
ACC-88291	0.94	high	Connected to 3 flagged merchants
ACC-12047	0.08	high	Normal pattern for network cluster
ACC-55103	0.82	medium	New device + unusual merchant graph
ACC-37820	0.03	high	Established pattern, trusted network

Three paths to production graph ML

Enterprise teams have three viable paths to deploy graph ML, each with different trade-offs on control, speed, and required expertise.

Path 1: Build a custom GNN pipeline

Use PyTorch Geometric or DGL to build a graph neural network from scratch. You control every architectural decision: message passing layers, aggregation functions, attention mechanisms, training procedure.

Requirements: 2 to 3 ML engineers with GNN experience, 6 to 12 months for the first production model, GPU infrastructure for training. Pinterest and DoorDash took this path.

Best for: Organizations with deep ML teams, unique graph structures that differ significantly from standard relational databases, and prediction tasks where architectural customization matters.

Path 2: Use a graph ML platform

Managed platforms handle graph construction, training, and serving. You provide data and define the prediction task. The platform handles the GNN architecture, hyperparameter tuning, and deployment.

Requirements: 1 data scientist, 2 to 4 weeks for the first model. Infrastructure is managed.

Best for: Teams that want graph ML without building GNN expertise in-house. Good for standard relational prediction tasks (churn, fraud, recommendations).

Path 3: Relational foundation model

A pre-trained foundation model like KumoRFM that already understands relational patterns. You connect your database, define the prediction task in one line of PQL (Predictive Query Language), and get predictions. No training, no graph construction, no feature engineering.

Requirements: Any team member who can write SQL. Minutes to first prediction. Zero-shot predictions available immediately; fine-tuning takes hours for higher accuracy.

Best for: Organizations that want the accuracy benefits of graph ML without the infrastructure investment. Ideal for teams running many different prediction tasks across the same relational database.

three_paths_comparison

Dimension	Custom GNN	Graph ML Platform	Foundation Model
Team Required	2-3 GNN specialists	1 data scientist	SQL-literate analyst
Time to First Model	3-6 months	2-4 weeks	Minutes (zero-shot)
First Model Cost	$500K-1M	$50K-150K	$5K-20K
Marginal Cost/Task	$50K-200K	$20K-50K	Near-zero
Architectural Control	Full	Limited	None
Best For	1-2 unique models	5-10 standard tasks	10+ tasks at scale

The right path depends on how many prediction tasks you need. For most enterprises running 5+ tasks, the foundation model path dominates on economics.

Deployment architecture for graph ML

Production graph ML systems have four components that differ from traditional ML deployments.

1. Graph construction layer

Your relational database needs to be represented as a graph. In custom pipelines, this means ETL jobs that extract entities and relationships and build adjacency matrices. For a 100-million-row database, initial graph construction takes 2 to 8 hours; incremental updates take minutes.

Foundation models handle this automatically. KumoRFM reads your database schema, identifies entity types and relationships from foreign keys, and constructs the temporal graph on the fly.

2. Embedding computation

GNNs produce embeddings (dense vector representations) for each node. These embeddings encode the node's attributes and its graph neighborhood. For batch predictions, embeddings are computed offline and stored. For real-time predictions, you need a serving layer that can compute or retrieve embeddings in under 100ms.

3. Prediction serving

Batch predictions are straightforward: run the GNN overnight, store scores in a database, serve them through your existing application. Real-time predictions require a model serving infrastructure (Triton, TensorFlow Serving, or a custom gRPC service) that can handle graph lookups and GNN inference at request time.

Most enterprise deployments start with batch. Fraud detection is the primary use case that requires real-time serving. Recommendations, churn, and lead scoring typically run daily or hourly batches.

4. Graph update pipeline

Graphs change as new transactions, users, and interactions arrive. Your pipeline needs to handle incremental graph updates without rebuilding the entire graph. For custom GNNs, this is the hardest engineering challenge. For platform and foundation model approaches, updates are handled by the platform.

Production results across industries

The following results come from published case studies and benchmark results, not synthetic examples.

E-commerce and marketplaces

DoorDash: 1.8% engagement lift on recommendations across 30 million users using a heterogeneous graph of customers, restaurants, menu items, and delivery interactions. Pinterest: graph ML powers content recommendations for 450 million monthly active users, with the graph containing over 18 billion pins and 200 million boards.

Financial services

Visa: graph-based fraud detection processes billions of transactions, identifying fraud rings that transaction-level models miss. On the RelBench credit card fraud benchmark, GNNs achieve 78.4 AUROC compared to 71.2 for LightGBM with manual features, a 7.2-point improvement that translates to millions in recovered fraud losses.

B2B SaaS

Snowflake: 3.2x lift in expansion revenue prediction by modeling the graph of accounts, users, queries, datasets, and feature usage. Databricks: 5.4x conversion lift on lead scoring by incorporating the relationship graph between companies, contacts, product usage events, and support interactions.

Healthcare

On the RelBench clinical trial benchmark (15 tables, 2.3 million rows), GNNs predict adverse drug events with 12 points higher AUROC than flat models, by learning from the graph of patients, conditions, medications, and treatment protocols.

Common objections and honest answers

"Our data scientists do not know graph ML"

Foundation models remove this barrier. KumoRFM requires zero graph ML expertise. You write a prediction query in PQL, which looks like SQL with a PREDICT clause. If your team can write SQL, they can use a relational foundation model.

"We tried a knowledge graph and it did not work"

Knowledge graphs and graph ML are different things. Knowledge graphs are symbolic (RDF triples, SPARQL queries, ontology engineering). Graph ML is statistical (learned embeddings, message passing, gradient descent). The failure of a knowledge graph project says nothing about graph ML's viability.

"We cannot afford GPU infrastructure"

Custom GNN training requires GPUs. Foundation model inference does not require you to own GPUs because the model runs on managed infrastructure. KumoRFM's zero-shot predictions run in seconds without any training step.

"Our graph is too large"

GNN scaling has improved dramatically. Mini-batch training with neighbor sampling allows training on graphs with billions of edges using a single GPU. Pinterest trains on a graph with 18 billion pins. The RelBench benchmark includes datasets with 41 million rows. If your data fits in a relational database, it fits in a graph.

Getting started: a 30-day evaluation plan

If you are considering graph ML for your organization, here is a practical evaluation plan that takes 30 days and requires no infrastructure investment.

Week 1: Identify 3 prediction tasks where your current models underperform or where feature engineering is the bottleneck. Prioritize tasks that depend on multi-table data.
Week 2: Run zero-shot predictions using a relational foundation model on all 3 tasks. Compare AUROC, precision, and recall against your production models.
Week 3: Fine-tune on the most promising task. Measure the accuracy improvement from fine-tuning and estimate the business impact using your existing conversion models.
Week 4: Build the business case. Calculate the TCO of your current pipeline (team time, infrastructure, maintenance) versus the foundation model approach. Include the time-to-value difference: months versus days.

The 30-day evaluation costs nothing beyond team time. If graph ML does not outperform your current approach on any of the 3 tasks, you have a definitive answer in a month. If it does, you have the data to fund a production deployment.

Key Takeaways

1Graph ML is in production at DoorDash (1.8% engagement lift, 30M users), Snowflake (3.2x expansion lift), Databricks (5.4x conversion lift), Pinterest (450M MAU), and Visa (billions of transactions). These are not experiments.
2Graph ML outperforms traditional ML when data spans 3+ connected tables, network effects matter (fraud, recommendations), or cold-start entities are common. On RelBench, GNNs beat LightGBM by 13.4 AUROC points on multi-table tasks.
3Three paths to production: custom GNN (full control, $500K-1M, 3-6 months), graph ML platform ($50K-150K, 2-4 weeks), and relational foundation model ($5K-20K/task, minutes). The right path depends on your task count and team.
4The most common objections (no GNN expertise, knowledge graph failures, GPU costs, graph scale) are all addressed by foundation models that encapsulate GNN architectures behind a SQL-like query interface.
5Start with a 30-day evaluation: identify 3 tasks, run zero-shot predictions, compare against production models, and build the business case. The evaluation costs nothing beyond team time.

Graph ML for Enterprise: A Practical Guide