Berlin Tech Meetup: The Future of Relational Foundation Models, Systems, and Real-World Applications

Register now:
Learn16 min read

5 Approaches to Enterprise ML: A Practical Comparison

Manual pipelines, AutoML, LLMs on tables, graph neural networks, and relational foundation models. No vendor names. Just approaches, trade-offs, and when each makes sense.

TL;DR

  • 1Five approaches compared on RelBench (7 databases, 30 tasks, 103M+ rows): Manual ML 62.44 AUROC, AutoML ~63-65, LLMs on tables 68.06, GNNs 75.83, foundation models 76.71 zero-shot / 81.14 fine-tuned.
  • 2Manual ML and AutoML both require feature engineering (80% of time). AutoML automates model selection and tuning (the last 20%) but leaves the bottleneck -- 12.3 hours and 878 lines of code per task -- fully intact.
  • 3LLMs on tables are fast to prototype but structurally limited: 68.06 AUROC reflects a deep mismatch between text-token processing and numerical relational patterns. The 8.6-point gap vs. foundation models is architectural.
  • 4Cost at 10 tasks: Manual ML $1.5M-5M, AutoML $500K-2M, GNNs $1M-3M, foundation models $100K-300K. Near-zero marginal cost per task is what separates foundation models at scale.
  • 5Three questions determine the right approach: Is your data relational (3+ tables)? How many prediction tasks do you need (5+ favors foundation models)? Does your team have GNN expertise (if no, foundation models deliver comparable accuracy)?

Enterprise ML has a fragmentation problem. Five distinct approaches compete for the same budget, and each vendor claims theirs is best. The truth is that each approach has a genuine sweet spot and genuine limitations. This comparison strips away the marketing and evaluates all five on the dimensions that matter: accuracy, time to value, team requirements, cost per prediction task, and which data structures they handle.

All accuracy numbers come from the RelBench benchmark (7 databases, 30 tasks, 103M+ rows, temporal splits). This is the only benchmark designed for multi-table relational data with proper temporal evaluation.

five_approaches_head_to_head

MetricManual MLAutoMLLLMs on TablesCustom GNNFoundation Model
AUROC (RelBench)62.44~63-6568.0675.8376.71 / 81.14
Time to 1st Prediction3-6 months1-3 monthsHours3-6 monthsMinutes
Cost for 10 Models$1.5M-5M$500K-2M$200K-500K$1M-3M$100K-300K
Feature Engineering100% manual100% manualNone (serialized)None (learned)None (learned)
Team Required2-3 data scientists1 data scientist1 ML engineer2-3 GNN specialistsSQL-literate analyst
Multi-table SupportManual joinsManual joinsSerializationNative graphNative graph
Cold-start EntitiesNoNoLimitedYesYes
Marginal Cost/Task$150K-500K$100K-300K$50K-100K$50K-200KNear-zero

All accuracy numbers from RelBench (7 databases, 30 tasks, 103M+ rows, temporal splits). Foundation models lead on accuracy, speed, and cost simultaneously.

Approach 1: Manual ML pipelines

A team of data scientists writes SQL to engineer features from your relational database, builds a flat feature table, trains a gradient-boosted model (XGBoost, LightGBM), and deploys it through a serving layer.

How it works

The data scientist studies the database schema, writes SQL joins across relevant tables, computes aggregate features (count, sum, average, max, min across time windows), trains a model on the flat output, tunes hyperparameters, validates with temporal splits, and deploys. Each new prediction task repeats this cycle.

Accuracy

On RelBench, LightGBM with features engineered by a Stanford-trained data scientist achieves 62.44 average AUROC on classification tasks. This is the best-effort result with unlimited time, full domain knowledge, and experienced practitioners. The accuracy ceiling is set by what the human can engineer, not by the model's capacity.

Time and cost

3 to 6 months per prediction model. Team of 2 to 3 data scientists at $200K to $300K fully loaded cost each. Per-model cost: $150K to $500K including infrastructure and opportunity cost. For 10 models: $1.5M to $5M over 2 to 3 years.

When it makes sense

  • Single-table data where feature engineering is minimal
  • Highly regulated domains requiring full feature transparency
  • Established teams with deep domain expertise in the specific prediction
  • One or two high-value models that justify the investment

When it breaks down

  • Multi-table data requiring complex cross-table features
  • More than 5 prediction tasks (cost scales linearly with tasks)
  • Cold-start entities with no historical features
  • Teams that cannot hire or retain data scientists

what_each_approach_sees_for_customer_C-482

SignalManual MLAutoMLLLMGNNFoundation Model
Own attributes (age, balance)YesYesYesYesYes
30-day order count = 5YesYesYesYesYes
Orders declining (5,4,3,2,1)No (aggregated)No (aggregated)PartialYesYes
Bought same products as churnersNo (3-hop)No (3-hop)NoYesYes
Support agent has low resolution rateNo (2-hop)No (2-hop)NoYesYes
Prediction accuracy (AUROC)62.44~63-6568.0675.8376.71

Highlighted: signals that only graph-based approaches capture. The 14-point AUROC gap between manual ML and foundation models comes from these invisible multi-hop and temporal patterns.

Approach 2: AutoML platforms

Upload a flat feature table to an AutoML platform. The platform automatically tests hundreds of model architectures, tunes hyperparameters, selects features, and produces a deployable model.

How it works

You prepare a flat feature table (this step is still manual). The platform runs automated experiments: trying logistic regression, random forests, gradient-boosted trees, neural networks, and ensembles. It selects the best model based on cross-validation performance and provides a deployment endpoint.

Accuracy

On single flat tables, AutoML matches expert-tuned models within 1 to 2% accuracy. The platform optimizes the last 20% of the pipeline (model selection, hyperparameters) effectively. But on multi-table relational data, accuracy is capped by the quality of the input feature table, which is still manually engineered. Expected RelBench-equivalent: roughly 63 to 65 AUROC with the same manual features (marginally better model selection does not overcome feature engineering limitations).

Time and cost

Feature engineering: still 4 to 8 weeks per task. Model building: reduced from weeks to hours. Per-model cost: $100K to $300K (feature engineering dominates). Platform license: $50K to $200K per year. For 10 models: $500K to $2M.

When it makes sense

  • Team has feature engineering capacity but limited modeling expertise
  • Multiple similar prediction tasks on the same feature table
  • Need to quickly iterate on model selection and tuning
  • Compliance requires model comparison documentation

When it breaks down

  • Feature engineering is the bottleneck (AutoML does not help)
  • Multi-table data requiring new features for each task
  • Tasks where feature quality, not model choice, limits accuracy

Manual ML pipelines

  • Full control over every decision
  • 62.44 AUROC on RelBench (feature-limited)
  • $150K-500K per model, 3-6 months
  • Requires 2-3 data scientists per model
  • Each new task starts from scratch

AutoML platforms

  • Automates model selection and tuning
  • ~63-65 AUROC (still feature-limited)
  • $100K-300K per model, 1-2 months faster
  • Requires 1 data scientist for features
  • Still needs manual feature engineering

Approach 3: LLMs on tables

Serialize your tables as CSV or JSON text, feed them to a large language model, and prompt it to make predictions.

How it works

Convert table rows into text strings. Feed them to an LLM with a prompt like "Based on this customer's transaction history, will they churn?" The LLM processes the serialized data as a text sequence and outputs a prediction. Some approaches fine-tune the LLM on serialized tabular data.

Accuracy

On RelBench, Llama 3.2 3B achieves 68.06 average AUROC on classification tasks. This is better than the manual LightGBM baseline (62.44) but well below GNNs (75.83) and relational foundation models (76.71). The LLM can apply some patterns from its language pre-training (understanding that "high return rate" is negative), but it misses numerical relationships and graph structure.

Time and cost

Fast to prototype (hours). But inference cost is high: processing serialized tables through a large LLM consumes significant compute. At enterprise scale (millions of predictions), inference costs $50K to $200K per month. Fine-tuning adds $10K to $50K per task.

When it makes sense

  • Quick prototyping when you need a prediction in hours, not months
  • Data with significant text content (product descriptions, customer notes)
  • Low-stakes predictions where 68 AUROC is acceptable
  • Teams with LLM infrastructure but no tabular ML expertise

When it breaks down

  • Numerical precision matters (financial data, sensor readings)
  • Multi-table relational structure carries signal
  • High-volume predictions where inference cost matters
  • Accuracy requirements above 70 AUROC

Approach 4: Graph neural networks

Represent your relational database as a graph (rows as nodes, foreign keys as edges) and train a GNN to learn directly from the connected structure.

How it works

Build an ETL pipeline that converts your relational database into a heterogeneous temporal graph. Design a GNN architecture (message passing layers, aggregation functions, temporal encoding). Train on your data with GPU infrastructure. Deploy through a graph serving layer.

Accuracy

On RelBench, a supervised GNN achieves 75.83 average AUROC on classification tasks. That is a 13.4-point improvement over manual feature engineering, reflecting the GNN's ability to discover multi-hop patterns, temporal sequences, and cross-table interactions that humans cannot enumerate.

Time and cost

First model: 3 to 6 months, team of 2 to 3 ML engineers with GNN expertise. Cost: $500K to $1M. Incremental models: $50K to $200K each (graph and architecture are reusable). GPU infrastructure: $5K to $20K per month. For 10 models: $1M to $3M.

When it makes sense

  • Multi-table relational data with rich connection patterns
  • Prediction tasks where network effects matter (fraud, recommendations)
  • Team with GNN expertise and 6+ months of runway
  • 1 to 3 high-value models that justify the infrastructure investment

When it breaks down

  • No GNN expertise on the team and unable to hire
  • More than 5 prediction tasks (custom training per task)
  • Rapid iteration needed (weeks, not months)
  • Budget constraints on GPU infrastructure

LLMs on tables

  • Fast to prototype (hours)
  • 68.06 AUROC on RelBench
  • High inference cost at scale
  • Misses numerical and relational patterns
  • Good for text-heavy data

Graph neural networks

  • 3-6 months for first model
  • 75.83 AUROC on RelBench
  • Efficient inference after training
  • Captures multi-hop and temporal patterns
  • Requires specialized GNN expertise

Approach 5: Relational foundation models

A pre-trained model that has already learned universal patterns from thousands of relational databases. Connect your data, write a prediction query, get results. No feature engineering, no model training, no GNN expertise.

How it works

The model is pre-trained on data from 5,000+ diverse relational databases. At inference, you connect your database, and the model reads your schema, constructs a temporal graph internally, and makes predictions. You define the task in PQL (Predictive Query Language), which looks like SQL with a PREDICT clause. Zero-shot predictions are immediate. Fine-tuning takes hours for higher accuracy.

Accuracy

On RelBench, zero-shot achieves 76.71 average AUROC, outperforming the supervised GNN (75.83) without any task-specific training. Fine-tuned achieves 81.14 AUROC. The zero-shot result is the key number: it means the pre-training captured enough universal patterns that task-specific training is optional for many use cases.

Time and cost

Zero-shot: minutes. Fine-tuning: 2 to 8 hours. No ML expertise required (SQL is sufficient). Platform cost: varies by data volume and query frequency. For 10 models: $100K to $300K total, because the marginal cost per additional task approaches zero.

When it makes sense

  • Multiple prediction tasks (5+) on the same relational database
  • Time to value matters more than architectural control
  • Team lacks ML or GNN expertise
  • Need to evaluate graph ML potential before committing to custom build
  • Budget-constrained: highest accuracy per dollar spent

When it breaks down

  • Data is not relational (single flat table, images, text-only)
  • Need full architectural control for competitive differentiation
  • Extreme regulatory requirements that prohibit pre-trained models

Head-to-head summary

DimensionManual MLAutoMLLLMsGNNsFoundation model
AUROC (RelBench)62.44~63-6568.0675.8376.71 / 81.14
Time to first prediction3-6 months1-3 monthsHours3-6 monthsMinutes
Cost for 10 models$1.5M-5M$500K-2M$200K-500K$1M-3M$100K-300K
Team required2-3 data scientists1 data scientist1 ML engineer2-3 GNN specialistsSQL-literate analyst
Multi-table handlingManual joinsManual joinsSerializationNative graphNative graph
Cold-start supportNoNoLimitedYesYes
Feature engineering100% manual100% manualNone (serialized)None (learned)None (learned)

PQL Query

PREDICT COUNT(orders.*, 0, 30) > 0
FOR EACH customers.customer_id
WHERE customers.segment = 'Enterprise'

This single PQL query delivers what takes 3-6 months and $150K-500K via manual ML. The foundation model reads the relational schema, constructs the graph, and predicts -- no feature engineering, no training, no pipeline.

Output

customer_idpredictionconfidenceapproach_comparison
ENT-48210.87highManual ML: 3-6 months to match
ENT-10930.34highAutoML: still needs feature table
ENT-77560.15highLLM: 68 vs 77 AUROC on this task
ENT-33020.94highGNN: matches accuracy, 100x slower

Decision framework

Ask three questions to determine which approach fits:

  • Is your data relational (3+ connected tables)? If no, manual ML or AutoML on a single table is sufficient. If yes, graph-based approaches (GNN or foundation model) provide a structural accuracy advantage.
  • How many prediction tasks do you need? For 1 to 2, any approach works. For 5+, the marginal cost per task matters, and foundation models win on economics. For 10+, manual approaches become impractical.
  • Does your team have GNN expertise? If yes and you need maximum architectural control, custom GNNs are justified. If no, a foundation model delivers comparable accuracy without the hiring challenge.

The trend line is clear: enterprise ML is moving from manual, single-task pipelines toward pre-trained, multi-task foundation models. Not because foundation models are always better on a single task, but because the economics of running 10 to 100 predictions make per-task approaches untenable.

KumoRFM was built by the team behind the ML systems at Pinterest, Airbnb, and LinkedIn: Vanja Josifovski (CEO, former CTO at Airbnb and Pinterest), Jure Leskovec (Chief Scientist, Stanford professor, co-creator of GraphSAGE), and Hema Raghavan (Head of Engineering, former Sr. Director at LinkedIn). Backed by Sequoia Capital.

Frequently asked questions

Which ML approach is best for enterprise use?

It depends on your data structure and team. For single-table data with an established team, manual ML pipelines remain effective. For multi-table relational data with multiple prediction tasks, relational foundation models provide the best accuracy-to-effort ratio: 76.71 AUROC zero-shot on RelBench with no feature engineering, compared to 62.44 for manual LightGBM with full feature engineering.

Can AutoML replace a data science team?

AutoML automates model selection and hyperparameter tuning, which is the last 20% of the ML pipeline. It does not automate feature engineering (80% of time), data quality assessment, or business integration. You still need a data scientist to prepare the flat feature table that AutoML requires as input. AutoML reduces the modeling effort, not the total effort.

When should I use a graph neural network instead of XGBoost?

Use a GNN when your prediction depends on information spread across 3 or more connected tables, when network effects matter (fraud, recommendations, social), or when cold-start entities are common. On RelBench, GNNs outperform XGBoost by 13+ AUROC points on multi-table tasks. For single flat tables, XGBoost remains competitive.

Are LLMs good at structured data prediction?

No. On the RelBench benchmark, Llama 3.2 3B achieved 68.06 AUROC on classification tasks by serializing tables as text. GNNs achieved 75.83 and KumoRFM zero-shot achieved 76.71. LLMs process structured data as tokens and miss the numerical relationships, schema structure, and relational patterns that purpose-built models capture.

What is the total cost difference between these approaches?

For 10 prediction tasks: Manual ML pipelines cost $1.5M-5M (team time, 3-6 months per model). AutoML costs $500K-2M (reduces modeling time but not feature engineering). Custom GNNs cost $1M-3M (expensive first model, cheaper incremental). LLMs on tables cost $200K-500K (compute-heavy, lower accuracy). Foundation models cost $100K-300K (near-zero marginal cost per task).

See it in action

KumoRFM delivers predictions on relational data in seconds. No feature engineering, no ML pipelines. Try it free.