In 2021, DoorDash published a blog post describing how they rebuilt their recommendation system using graph neural networks. The result was a 1.8% engagement lift across 30 million users. That may sound small until you calculate the revenue impact at DoorDash's scale: roughly $50 million in incremental annual GMV.
Two years later, graph ML is running in production at Pinterest (content recommendations for 450 million monthly users), Visa (fraud detection on billions of transactions), Snap (friend suggestions), and dozens of Fortune 500 companies that do not publicize their implementations. Snowflake used it internally for expansion revenue prediction and saw a 3.2x lift over their previous gradient-boosted model.
Yet most enterprise data science teams have never shipped a graph ML model. The gap is not technical. It is informational. Teams do not know when graph ML adds value, what the evaluation criteria should be, or what deployment actually requires. This guide addresses all three questions.
What graph ML actually does
Every machine learning model needs to find patterns in data. Traditional models (logistic regression, XGBoost, random forests) find patterns in flat tables: rows of features, one per entity. Graph ML finds patterns in connected structures: entities as nodes, relationships as edges.
The distinction matters because most enterprise data is inherently relational. A customer is connected to orders, orders to products, products to categories, categories to seasonal trends. A patient is connected to diagnoses, prescriptions, lab results, providers, and insurance claims. These connections carry predictive signal that flat tables cannot represent.
Consider churn prediction. A traditional model might use features like days since last purchase, total spend, and number of support tickets. A graph ML model sees that plus the following: the customer's neighbors (people who bought similar products) are churning at 3x the normal rate. The products they recently purchased have a 40% return rate. The support agent they interacted with has a resolution rate 20 points below average.
None of those signals exist in a flat feature table unless someone manually engineers them. And nobody does, because the combinatorial space of possible multi-hop features is too large to explore by hand.
production_graph_ml_deployments
| Company | Use Case | Graph Scale | Result | Year |
|---|---|---|---|---|
| DoorDash | Recommendations | 30M users, heterogeneous | 1.8% engagement lift | 2021 |
| Content recommendations | 18B pins, 450M MAU | Core ranking system | 2018 | |
| Visa | Fraud detection | Billions of transactions | Fraud ring detection | 2020 |
| Snowflake | Expansion revenue | Accounts-users-queries graph | 3.2x lift over GBT | 2023 |
| Databricks | Lead scoring | Companies-contacts-usage graph | 5.4x conversion lift | 2023 |
These are published production deployments, not research prototypes. Graph ML is running at Fortune 500 scale.
How GNNs learn from structure
A graph neural network works through message passing. Each node collects information from its neighbors, aggregates it, and updates its own representation. After multiple rounds, each node's embedding encodes information from its entire local neighborhood, not just its own attributes.
With 3 rounds of message passing, a node's representation captures information from every entity within 3 hops. For a customer in an e-commerce graph, that includes their orders, the products in those orders, other customers who bought those products, and those customers' order histories. That is the kind of signal that takes a data scientist weeks to engineer manually and typically still misses the most informative patterns.
gnn_message_passing_concrete_example
| Layer | Customer C-401 Sees | New Information | Embedding Updates |
|---|---|---|---|
| Input | Own attributes only | segment=Enterprise, tenure=3yr, ARR=$180K | Initial node vector |
| Layer 1 | 5 orders, 2 support tickets | avg_order=$12K, 1 escalated ticket | Adds interaction patterns |
| Layer 2 | 8 products, 3 support agents, 4 invoices | 2 products have 30%+ churn rate among buyers | Adds product-risk signal |
| Layer 3 | 52 similar customers (via shared products) | 38 of 52 are active; 14 churned last quarter | Adds peer-behavior signal |
By layer 3, the model knows that 27% of similar customers (those who bought the same products) churned recently. This peer-churn signal is the strongest predictor but requires traversing 3 tables -- no flat feature table captures it.
When graph ML adds value over traditional approaches
Graph ML is not universally superior to traditional ML. There are specific conditions where it provides measurable uplift, and conditions where the added complexity is not justified.
High-value scenarios
Graph ML consistently outperforms traditional approaches in four situations:
- Multi-table relational data. If your prediction depends on information spread across 3 or more tables, graph ML eliminates the feature engineering bottleneck. The RelBench benchmark showed that GNNs outperformed LightGBM with manual features on 11 of 12 classification tasks across 7 multi-table databases.
- Network effects matter. Fraud detection, social recommendations, and marketplace dynamics all depend on how entities relate to each other. A fraudulent transaction is not just about the transaction attributes; it is about the network of accounts, devices, and merchants connected to it.
- Cold-start problems. New users with no history are impossible for traditional models. Graph ML can predict their behavior based on the entities they are connected to: the product they first viewed, the channel they came from, the referrer who invited them.
- Feature engineering is the bottleneck. If your data science team spends 60% or more of their time writing SQL joins and aggregations, graph ML eliminates that step entirely. The Stanford study measured the cost: 12.3 hours and 878 lines of code per prediction task for experienced data scientists.
Low-value scenarios
Graph ML adds less value in these situations:
- Single-table data. If your prediction depends on one flat table with well-defined features, XGBoost or a neural network will perform comparably with less infrastructure.
- Extremely sparse graphs. If entities have very few connections (less than 2 edges per node on average), the graph structure carries minimal signal.
- Real-time latency under 10ms. GNN inference on large graphs can take 50 to 200ms. If you need sub-10ms latency, you will need a pre-computed embedding approach or a simpler model.
Traditional ML approach
- Flatten 10-50 tables into a single feature table
- Engineer 100-500 features manually per use case
- Miss multi-hop and network-effect signals
- Rebuild pipeline for every new prediction task
- 3-6 month cycle per model
Graph ML approach
- Represent database as a graph directly
- Model learns features from relational structure
- Captures multi-hop, temporal, and network patterns
- Same architecture handles any prediction task
- Days to weeks per model with foundation models
graph_ml_value_by_scenario
| Scenario | Traditional ML AUROC | Graph ML AUROC | Uplift | Why Graph Wins |
|---|---|---|---|---|
| Multi-table relational | 62.44 | 75.83 | +13.4 pts | Cross-table patterns |
| Fraud with ring patterns | 71.2 | 78.4 | +7.2 pts | Network topology |
| Cold-start users | ~50 (random) | 65-75 | +15-25 pts | Neighbor signal propagation |
| Single flat table | 82-85 | 82-85 | ~0 pts | No structural advantage |
| Sparse graph (<2 edges/node) | 75-80 | 76-81 | +1-2 pts | Minimal graph signal |
Graph ML's advantage scales with relational complexity. On single-table data, XGBoost remains competitive.
Evaluating graph ML: the right benchmarks
Most ML benchmarks use single-table datasets (UCI, Kaggle competitions) where graph ML has no structural advantage. To evaluate graph ML properly, you need benchmarks designed for relational data.
RelBench: the standard benchmark
RelBench is the first standardized benchmark for ML on relational databases. Published at NeurIPS 2024 by researchers at Stanford and Kumo.ai, it includes 7 databases, 30 prediction tasks, and over 103 million rows. Each database has 3 to 15 interconnected tables. Tasks include classification (churn, fraud, conversion) and regression (lifetime value, demand forecasting).
The benchmark enforces temporal splits: training data comes before the evaluation period, and test data comes after. This prevents data leakage, which inflates accuracy in many published results. On RelBench, GNNs achieve an average AUROC of 75.83 on classification tasks, compared to 62.44 for LightGBM with features engineered by a Stanford-trained data scientist. KumoRFM zero-shot reaches 76.71, and fine-tuned reaches 81.14.
What to measure for your use case
Do not evaluate graph ML on aggregate metrics alone. The real value shows up in specific slices:
- Cold-start entities. Measure accuracy on entities with fewer than 5 historical events. This is where graph ML's structural advantage is largest, often 15 to 30 percentage points of AUROC improvement.
- Multi-hop signal tasks. Pick a prediction where the correct answer depends on information 2 or more hops away. For example, predicting whether a customer will return a product based on the return rates of similar products bought by similar customers.
- Temporal dynamics. Measure on tasks where the pattern changes over time. Graph ML with temporal encoding captures shifts that static feature tables miss.
PQL Query
PREDICT COUNT(transactions.*, 0, 7) > 3 AND AVG(transactions.amount, 0, 7) > 5 * AVG(transactions.amount, 0, 90) FOR EACH accounts.account_id
Fraud detection via PQL. The model uses graph structure to identify accounts with anomalous transaction patterns relative to their network of connected merchants, devices, and counterparties.
Output
| account_id | fraud_risk | confidence | graph_signal |
|---|---|---|---|
| ACC-88291 | 0.94 | high | Connected to 3 flagged merchants |
| ACC-12047 | 0.08 | high | Normal pattern for network cluster |
| ACC-55103 | 0.82 | medium | New device + unusual merchant graph |
| ACC-37820 | 0.03 | high | Established pattern, trusted network |
Three paths to production graph ML
Enterprise teams have three viable paths to deploy graph ML, each with different trade-offs on control, speed, and required expertise.
Path 1: Build a custom GNN pipeline
Use PyTorch Geometric or DGL to build a graph neural network from scratch. You control every architectural decision: message passing layers, aggregation functions, attention mechanisms, training procedure.
Requirements: 2 to 3 ML engineers with GNN experience, 6 to 12 months for the first production model, GPU infrastructure for training. Pinterest and DoorDash took this path.
Best for: Organizations with deep ML teams, unique graph structures that differ significantly from standard relational databases, and prediction tasks where architectural customization matters.
Path 2: Use a graph ML platform
Managed platforms handle graph construction, training, and serving. You provide data and define the prediction task. The platform handles the GNN architecture, hyperparameter tuning, and deployment.
Requirements: 1 data scientist, 2 to 4 weeks for the first model. Infrastructure is managed.
Best for: Teams that want graph ML without building GNN expertise in-house. Good for standard relational prediction tasks (churn, fraud, recommendations).
Path 3: Relational foundation model
A pre-trained foundation model like KumoRFM that already understands relational patterns. You connect your database, define the prediction task in one line of PQL (Predictive Query Language), and get predictions. No training, no graph construction, no feature engineering.
Requirements: Any team member who can write SQL. Minutes to first prediction. Zero-shot predictions available immediately; fine-tuning takes hours for higher accuracy.
Best for: Organizations that want the accuracy benefits of graph ML without the infrastructure investment. Ideal for teams running many different prediction tasks across the same relational database.
three_paths_comparison
| Dimension | Custom GNN | Graph ML Platform | Foundation Model |
|---|---|---|---|
| Team Required | 2-3 GNN specialists | 1 data scientist | SQL-literate analyst |
| Time to First Model | 3-6 months | 2-4 weeks | Minutes (zero-shot) |
| First Model Cost | $500K-1M | $50K-150K | $5K-20K |
| Marginal Cost/Task | $50K-200K | $20K-50K | Near-zero |
| Architectural Control | Full | Limited | None |
| Best For | 1-2 unique models | 5-10 standard tasks | 10+ tasks at scale |
The right path depends on how many prediction tasks you need. For most enterprises running 5+ tasks, the foundation model path dominates on economics.
Deployment architecture for graph ML
Production graph ML systems have four components that differ from traditional ML deployments.
1. Graph construction layer
Your relational database needs to be represented as a graph. In custom pipelines, this means ETL jobs that extract entities and relationships and build adjacency matrices. For a 100-million-row database, initial graph construction takes 2 to 8 hours; incremental updates take minutes.
Foundation models handle this automatically. KumoRFM reads your database schema, identifies entity types and relationships from foreign keys, and constructs the temporal graph on the fly.
2. Embedding computation
GNNs produce embeddings (dense vector representations) for each node. These embeddings encode the node's attributes and its graph neighborhood. For batch predictions, embeddings are computed offline and stored. For real-time predictions, you need a serving layer that can compute or retrieve embeddings in under 100ms.
3. Prediction serving
Batch predictions are straightforward: run the GNN overnight, store scores in a database, serve them through your existing application. Real-time predictions require a model serving infrastructure (Triton, TensorFlow Serving, or a custom gRPC service) that can handle graph lookups and GNN inference at request time.
Most enterprise deployments start with batch. Fraud detection is the primary use case that requires real-time serving. Recommendations, churn, and lead scoring typically run daily or hourly batches.
4. Graph update pipeline
Graphs change as new transactions, users, and interactions arrive. Your pipeline needs to handle incremental graph updates without rebuilding the entire graph. For custom GNNs, this is the hardest engineering challenge. For platform and foundation model approaches, updates are handled by the platform.
Production results across industries
The following results come from published case studies and benchmark results, not synthetic examples.
E-commerce and marketplaces
DoorDash: 1.8% engagement lift on recommendations across 30 million users using a heterogeneous graph of customers, restaurants, menu items, and delivery interactions. Pinterest: graph ML powers content recommendations for 450 million monthly active users, with the graph containing over 18 billion pins and 200 million boards.
Financial services
Visa: graph-based fraud detection processes billions of transactions, identifying fraud rings that transaction-level models miss. On the RelBench credit card fraud benchmark, GNNs achieve 78.4 AUROC compared to 71.2 for LightGBM with manual features, a 7.2-point improvement that translates to millions in recovered fraud losses.
B2B SaaS
Snowflake: 3.2x lift in expansion revenue prediction by modeling the graph of accounts, users, queries, datasets, and feature usage. Databricks: 5.4x conversion lift on lead scoring by incorporating the relationship graph between companies, contacts, product usage events, and support interactions.
Healthcare
On the RelBench clinical trial benchmark (15 tables, 2.3 million rows), GNNs predict adverse drug events with 12 points higher AUROC than flat models, by learning from the graph of patients, conditions, medications, and treatment protocols.
Common objections and honest answers
"Our data scientists do not know graph ML"
Foundation models remove this barrier. KumoRFM requires zero graph ML expertise. You write a prediction query in PQL, which looks like SQL with a PREDICT clause. If your team can write SQL, they can use a relational foundation model.
"We tried a knowledge graph and it did not work"
Knowledge graphs and graph ML are different things. Knowledge graphs are symbolic (RDF triples, SPARQL queries, ontology engineering). Graph ML is statistical (learned embeddings, message passing, gradient descent). The failure of a knowledge graph project says nothing about graph ML's viability.
"We cannot afford GPU infrastructure"
Custom GNN training requires GPUs. Foundation model inference does not require you to own GPUs because the model runs on managed infrastructure. KumoRFM's zero-shot predictions run in seconds without any training step.
"Our graph is too large"
GNN scaling has improved dramatically. Mini-batch training with neighbor sampling allows training on graphs with billions of edges using a single GPU. Pinterest trains on a graph with 18 billion pins. The RelBench benchmark includes datasets with 41 million rows. If your data fits in a relational database, it fits in a graph.
Getting started: a 30-day evaluation plan
If you are considering graph ML for your organization, here is a practical evaluation plan that takes 30 days and requires no infrastructure investment.
- Week 1: Identify 3 prediction tasks where your current models underperform or where feature engineering is the bottleneck. Prioritize tasks that depend on multi-table data.
- Week 2: Run zero-shot predictions using a relational foundation model on all 3 tasks. Compare AUROC, precision, and recall against your production models.
- Week 3: Fine-tune on the most promising task. Measure the accuracy improvement from fine-tuning and estimate the business impact using your existing conversion models.
- Week 4: Build the business case. Calculate the TCO of your current pipeline (team time, infrastructure, maintenance) versus the foundation model approach. Include the time-to-value difference: months versus days.
The 30-day evaluation costs nothing beyond team time. If graph ML does not outperform your current approach on any of the 3 tasks, you have a definitive answer in a month. If it does, you have the data to fund a production deployment.