Understanding KumoRFM: The First Foundation Model for Relational Data | Kumo.ai

01

The Problem

Foundation models have transformed how we work with unstructured data. Large language models handle arbitrary text tasks. Diffusion models generate images from descriptions. Code models autocomplete entire functions. In each case, a single pre-trained model generalizes across tasks without retraining.

But the most common data type inside enterprises, structured relational data stored across interconnected database tables, has no equivalent. Every prediction task (churn, fraud, recommendations, demand forecasting) still requires a team to manually engineer features, train a task-specific model, and maintain a bespoke pipeline.

A typical enterprise database has 10-50 interconnected tables: users → orders → products → categories; customers → transactions → accounts; patients → visits → doctors. Billions of rows, all connected through foreign keys. The data is rich, relational, and temporal.

Now ask a question: “Which customers will churn next month?” To answer this with conventional ML, you need to flatten those tables into a single feature table, engineer hundreds of aggregate features, train a model, validate it, and deploy it. The paper measured this precisely: a data scientist with a Stanford CS Master's degree and five years of experience needs 12.3 hours and 878 ± 77 lines of code for a single prediction task. And the process restarts from scratch for every new question.

The real cost: what gets lost

Yes, feature engineering is time-consuming. But many teams have already paid that cost. They have production pipelines, mature feature stores, and data scientists who know the domain. The time argument alone doesn't explain why these models still underperform.

The deeper problem is what you miss. No human can enumerate the combinatorial space of patterns that exist across interconnected tables. Feature engineering forces you to guess which relationships matter and manually encode them. You always miss more than you find. The paper's benchmark results prove this across real datasets:

Example: fashion retail churn (H&M dataset, 3 tables, 16.6M rows)

The paper evaluates on H&M's actual retail data: customers, transactions, and articles tables. The task: predict which customers will churn.

A typical feature table captures purchase frequency over time windows, average order value, days since last purchase, preferred product categories. Good features. Standard practice. But when KumoRFM analyzes a customer predicted to churn, its explainability output surfaces a specific combination of signals:

Order count combined with fashion news subscription and club membership status. Users with few past orders who also don't subscribe to fashion news and lack active club membership show dramatically higher churn probability. A feature engineer might include each of these as separate columns, but the interaction between low order count + no fashion news + no club membership is what drives the prediction. That three-way combination across columns is the kind of pattern that gets missed.
Order date recency at specific granularities. Not just “days since last order” but the temporal pattern of orders: are they accelerating, decelerating, or clustering? KumoRFM's time encoder captures these dynamics natively because it operates on the raw transaction sequence, not a pre-aggregated number.

Example: Formula 1 race outcomes (9 tables, 67 columns)

The rel-f1 dataset connects drivers, results, races, circuits, constructors, qualifying, and more across 9 tables. The task: predict whether a driver will finish in the top 3.

A feature table might capture career wins, recent finishes, qualifying position, constructor championship standing. But the multi-hop relationships carry signal no human would manually encode: a driver's qualifying performance at circuits with specific characteristics, their constructor's reliability history at those track types, weather-correlated performance of drivers with similar racing profiles. That requires traversing drivers → results → races → circuits → constructors → qualifying across multiple hops and time windows.

KumoRFM scores 91.07 AUROC on this task zero-shot. A human data scientist with full access to the data and unlimited time achieves 82.40. The gap isn't speed. It's coverage. The model finds patterns across 9 tables that a human simply doesn't explore.

Example: clinical trial outcomes (15 tables, 140 columns)

The rel-trial dataset contains studies, sponsors, sites, conditions, and interventions across 15 interconnected tables and 140 columns. Predicting whether a clinical trial will succeed depends on the sponsor's track record with similar conditions, the performance of trial sites, enrollment patterns, and how related conditions responded to similar interventions. This is 15 tables of context that a flat feature table compresses into a handful of aggregates, losing most of the relational signal in the process.

These are real datasets from the paper's benchmark (RelBench). In every case, the multi-table, multi-hop, time-aware patterns exist in the relational structure. Feature engineering requires a human to imagine each pattern, write the SQL to extract it, and validate that it helps. The combinatorial space is enormous, and humans explore a tiny corner of it.

The research question is direct: can we build a foundation model for relational data? One that learns general patterns across databases and generalizes to new schemas and new tasks without retraining? KumoRFM is the first model to demonstrate that the answer is yes.

02

Why Not XGBoost?

Gradient-boosted trees (XGBoost, LightGBM, CatBoost) are the default tool for tabular prediction. They're fast, robust, and well-understood. But they have a structural limitation when applied to relational data: they require a flat, single-table input. One row per entity, every feature in a column.

Real enterprise data isn't flat. It's a web of tables connected by foreign keys. To use XGBoost, you must first flatten this web: write SQL joins across all relevant tables, compute aggregate features (counts, sums, averages over time windows), and collapse everything into one row per entity.

The context and nuance that gets lost

This flattening process isn't just slow. It fundamentally destroys information. Here is what a flat feature table cannot represent:

Multi-hop relationships. A customer's behavior is influenced by the products they bought, which categories those belong to, which other customers bought similar products, and what those customers did next. These transitive, multi-hop patterns contain some of the strongest predictive signals, and they disappear entirely when you collapse everything into aggregate counts and averages.
Temporal sequences and ordering. A flat table might capture “this user placed 3 orders in the last 30 days.” But it loses the sequence: did they place all 3 in the first week and then go silent? Or did they place one per week with increasing order value? These temporal patterns carry completely different signals, but the aggregate (count = 3) is the same.
Graph topology. Fraud rings share structural signatures: tightly connected clusters of accounts transacting with each other. Supply chain cascades propagate through specific network topologies. Social influence patterns follow community structures. All of this lives in the shape of connections. Flattening destroys topology entirely.
Entity-level context. A flat feature row for user #4271 says “avg_order_value = $47, order_count = 12, days_since_last_order = 8.” But it doesn't capture that their last 3 orders were returns, that they contacted support twice about the same issue, or that the product category they browse most has been out of stock. Each of these is a row in a different table, and each carries nuance that an aggregate feature flattens away.

No reuse across tasks

The other fundamental problem is that every new prediction task needs its own feature engineering. Churn prediction needs different aggregations than fraud detection. Recommendation models need different join paths than lead-scoring models. There is no transfer, no reuse. Each task restarts the feature engineering cycle, which is why teams spend months on pipelines that only answer one question.

03

Why Not GNNs?

Graph Neural Networks address the representation problem directly. They model the relational database as a graph (rows become nodes, foreign-key relationships become edges) and learn via message passing between connected nodes. Multi-hop patterns, temporal dependencies, and graph topology are preserved in the learned representation.

Methods like GraphSAGE, GCN, and Graph Transformers demonstrated that learning directly on the relational graph outperforms flat-table approaches. The Relational Deep Learning (RDL) framework formalized this into reproducible pipelines and showed consistent improvements over manual feature engineering.

Where GNNs fall short

No generalization across tasks or databases. A GNN trained for churn prediction cannot be reused for fraud detection. A model trained on one database cannot transfer to another. The architecture, input layer, and edge types are all tied to the specific schema. Each task and each database requires training from scratch.
Expensive iteration. Even with the streamlined RDL pipeline, each task requires ~30 minutes of training and ~56 lines of setup code. Better than 12 hours of manual feature engineering, but still too slow when you have hundreds of tasks to explore.
No in-context learning. GNNs have no mechanism to adapt to a new task at inference time. They learn a fixed mapping during training. You cannot point a trained GNN at a new question and get a useful answer.
Limited explainability. Standard GNN architectures provide limited insight into why they made a specific prediction, which matters for regulated industries.

The key insight from GNNs is that the graph representation is correct. Relational data should be modeled as a graph, not flattened. What's missing is a way to learn universal patterns across many graphs so the model can generalize without retraining.

04

Why Not Build Your Own?

If the approach is sound, why not train your own foundation model on your data? Or use an existing LLM? Or use one of the emerging tabular foundation models? Each has been tried. Each runs into fundamental limitations.

Using LLMs on serialized tables

One approach is to serialize relational data into text (JSON, CSV, or natural language descriptions) and feed it into an LLM. The paper evaluates this directly using Llama 3.2 3B, and the results are clear: the LLM baseline averages 68.06 AUROC on classification tasks. Better than LightGBM (62.44), but significantly worse than KumoRFM zero-shot (76.71).

The reasons are structural. LLMs were pre-trained to predict the next token, not to minimize prediction error on structured data. They struggle with numerical patterns, suffer from context window limitations when processing large relational subgraphs, and are prone to hallucination on factual data. The training objective itself is mismatched: next-token prediction is fundamentally different from forecasting a numerical target or ranking items.

Tabular foundation models

Recent work on tabular foundation models (models pre-trained on collections of flat tables) addresses single-table prediction. These models learn column-wise and row-wise relationships within a single table. But they face critical limitations:

They are restricted to small-scale datasets due to context length and feature count constraints
They rely on complex input normalization and feature shuffling
Most importantly, they are confined to single, flat tables. Multiple data tables still need to be joined and flattened via manual feature engineering

In other words, tabular foundation models inherit the same flattening problem that limits XGBoost. They operate on the output of feature engineering, not on the raw relational data.

Training from scratch

Building a foundation model for relational data from scratch requires solving several hard problems simultaneously: a schema-agnostic encoder that handles arbitrary table structures, a graph-based architecture that scales across different relationship topologies, a training procedure that spans diverse database domains, and an in-context learning mechanism adapted to structured (not textual) data.

KumoRFM was pre-trained on a large and diverse mix of real-world databases and synthetic relational data, spanning e-commerce, social, medical, financial, and other domains. This breadth is what allows the model to learn universal relational patterns (recency, frequency, temporal dynamics, graph topology) rather than domain-specific shortcuts. Replicating this pre-training (the data curation, the architecture design, the training infrastructure) represents years of research and engineering.

05

The Foundation Model Insight

The reason LLMs generalize is pre-training at scale. By training on diverse text, the model learns patterns that recur across domains: grammar, logic, reasoning structures. At inference time, in-context examples in the prompt tell the model what specific task to perform.

The KumoRFM paper observes that relational databases share analogous universal patterns:

Recency, frequency, monetary (RFM) patterns. Recent and frequent interactions predict future behavior across every domain (e-commerce, finance, healthcare, social)
Temporal dynamics. Seasonality, trends, and decay patterns appear in timestamps and event sequences regardless of the specific schema
Graph topology. Hub-and-spoke structures, clusters, bridges between communities carry predictive signal across fraud detection, social networks, and supply chains
Cross-table propagation. An entity's behavior is influenced by connected entities, which are influenced by their connections. This multi-hop influence pattern is domain-agnostic

If you pre-train a model on a large and diverse collection of relational databases, it should learn these recurring patterns. Then, given a new database it has never seen, it should be able to leverage in-context examples (labeled historical data from that database) to make accurate predictions without any retraining.

This is exactly the hypothesis the paper tests, and the benchmark results confirm it.

06

What Is KumoRFM?

KumoRFM (Kumo Relational Foundation Model) is a foundation model pre-trained on a mix of publicly available real-world databases and synthetic relational data. No private enterprise data was used during pre-training.

It combines three components: a schema-agnostic row encoder that handles any table structure, a relational graph transformer that models cross-table relationships, and an in-context learning module that adapts to new tasks at inference time.

Key properties

Schema-agnostic. Works on any relational database schema: any number of tables, columns, and relationship types (one-to-many, many-to-many). No schema-specific configuration needed.
Multi-modal columns. Handles numerical, categorical, timestamp, text, embedding, and even hashed/anonymized identifier columns natively.
In-context learning. Generalizes to entirely new databases and tasks without retraining. Reads labeled examples from your data at prediction time and adapts.
Multi-task. Binary classification, multi-class, multi-label, regression, and link prediction (recommendations) from the same model weights.
Explainable. Global (dataset-level) and local (entity-level) explanations for every prediction.
Fine-tunable. Zero-shot predictions out of the box; fine-tune on your data for 10-30% additional accuracy.

Comparison with existing approaches

XGBoost / LightGBM

Flat tables only

+Fast on pre-engineered features
+Mature, well-understood

−Requires flat feature tables
−One model per task
−Manual feature engineering
−Loses relational structure

GNNs (GraphSAGE, RDL)

Right structure, no transfer

+Models relational graph directly
+Captures multi-hop patterns

−One model per task + database
−Schema-specific architecture
−~30 min training per task
−No in-context generalization

KumoRFM

Foundation model

+Any schema, any task, zero-shot
+Pre-trained relational patterns
+In-context learning at inference
+Built-in explainability
+Fine-tune for production accuracy

07

Predictive Query Language (PQL)

PQL is a declarative language for specifying prediction tasks on relational databases. Where SQL retrieves and manipulates existing data, PQL describes what to predict about the future. The model handles everything else: subgraph sampling, context generation, feature extraction, and inference.

Structure

PREDICT: the target, an aggregation over a column within a future time window
FOR EACH: the entity to predict for
WHERE (optional): filters on the entity set

Supported aggregations include SUM, COUNT, AVG, MAX, MIN, FIRST, and LIST_DISTINCT. Time windows are specified in hours, days, or months.

Mapping from PQL to task types

The aggregation scheme and the semantic type of the target column uniquely determine the underlying ML task. The model automatically infers the correct task type, loss function, and evaluation metric from the PQL query:

first categorical value

PREDICT FIRST(orders.type, 0, 7)
FOR EACH users.user_id IN (0, 1, 2)

(a) Node multi-class/label classification

logical value

PREDICT COUNT(orders.*, 0, 7) > 0
FOR EACH users.user_id IN (0, 1, 2)

(b) Node binary classification

sum of numerical values

PREDICT SUM(orders.value, 0, 7)
FOR EACH users.user_id IN (0, 1, 2)

(c) Node regression

set of foreign keys

PREDICT LIST_DISTINCT(orders.item_id, 0, 7)
FOR EACH users.user_id IN (0, 1, 2)

(d) Link prediction

Mapping from predictive queries to task types. The aggregation scheme and semantic type of the target column uniquely determine the ML task.

Why PQL matters

Three design properties of PQL enable robust foundation model predictions:

Label independence. The label definition is independent of the entity and time, which allows the system to generate additional labeled input data for context, even for entities not specified in the query.
Temporal safety. The language is intentionally restrictive around time manipulation, ensuring that past labels can be safely computed for any timestamp with unambiguous forward- and backward-looking time frames. This prevents data leakage.
Automatic task typing. The syntax and graph specification alone define the task type, allowing the system to automatically select the correct model head for downstream processing.

Each PQL query is parsed into an abstract syntax tree, validated, then executed: the system samples subgraphs, generates in-context labels, runs inference, and returns predictions. No feature engineering, no model selection, no training loop.

08

Architecture

KumoRFM's architecture has three stages that transform a raw relational database and a PQL query into a prediction.

1

Row Encoder

Encodes any table row into a dense vector, regardless of schema or column types.

→

2

Graph Transformer

Builds a temporal heterogeneous graph and performs attention-based message passing across tables.

→

3

In-Context Learning

Stacks context representations with ground-truth labels alongside test representations.

→

4

Prediction

Outputs classification probabilities, regression values, or ranked recommendation lists.

Stage 1: Table-invariant row encoder

Each row from any table is encoded into a fixed-dimensional vector. Columns are processed based on their semantic type: numerical values (e.g., price, age) are normalized and projected; categorical values (e.g., gender, genre) are embedded; multi-categorical values (e.g., movie genres) are embedded and pooled; timestamps (e.g., order dates) receive temporal encodings capturing periodicity and relative time; text columns (e.g., product descriptions) are encoded via a language model; and embeddings (e.g., custom upstream embeddings) are projected directly.

The encoder uses a Transformer over the two-dimensional cell grid of each table, which makes it agnostic to both table width (number of columns) and table size (number of rows). This is how KumoRFM handles arbitrary schemas without configuration.

Stage 2: Relational graph transformer

The database is represented as a temporal heterogeneous graph: rows are nodes, primary-foreign key relationships are edges. A Relational Graph Transformer performs self-attention over this graph, including dynamically attached context tables.

Four types of positional encoding capture the full relational structure:

Node type encoder: encodes the table type (user vs. order vs. product) for each node
Hop encoder: captures the structural proximity between the entity node and other nodes in the subgraph
Time encoder: encodes the relative time of events with respect to the prediction timestamp
Subgraph encoder: captures local fine-grained graph structure, including parent-child relationships and structural patterns like cycles

This is where KumoRFM differs from standard GNNs. A GNN uses fixed message-passing rules. The graph transformer uses learned attention patterns that adapt based on node type, hop distance, temporal context, and local topology. The positional encodings make it possible to apply the same transformer weights to any schema.

Stage 3: In-context learning module

At prediction time, the model generates context automatically. It samples historical entities from the database, retrieves their ground-truth labels for the specified PQL task using a forward-looking graph sampler, and encodes them as context representations. These context-label pairs are stacked alongside the test entities and processed by a transformer-based in-context learning (ICL) head.

This is the same principle as few-shot prompting in LLMs: the model sees examples of the task (entities with known outcomes) and uses them to calibrate predictions for new entities. The key innovation is that the context is generated online. The system dynamically constructs a training table, groups historical labels, and attaches it to the entity table via primary key-foreign key connections. This allows it to model both temporal proximity (e.g., the seed user's own monthly transactions over the past year) and relational proximity (e.g., transactions of nearby users).

On average, the system generates approximately 2 million in-context labels in under 1 second.

Link prediction

For recommendation tasks (link prediction), KumoRFM uses fully-inductive pair-wise representations: both user and item representations are read out from the user-centric subgraph via the Relational Graph Transformer. Item representations are uniformly sampled to a fixed context size according to sampling depth, so the model learns signals from items across a diverse range of hops (repeated purchases, collaborative patterns).

09

Benchmark Results

KumoRFM was evaluated on RelBench, a benchmark covering 7 relational databases, 30 tasks, and over 103 million rows across 51 tables. The databases span e-commerce, social, medical, and sports domains, with significantly varying scales:

RelBench dataset statistics. Databases vary in scale from 74K to 41M rows.
Dataset	Domain	#Tasks	#Tables	#Rows	#Columns
rel-amazon	E-commerce	7	3	15,000,713	15
rel-avito	E-commerce	4	8	20,679,117	42
rel-event	Social	3	5	41,328,337	128
rel-f1	Sports	3	9	74,063	67
rel-hm	E-commerce	3	3	16,664,809	37
rel-stack	Social	5	7	4,247,264	52
rel-trial	Medical	5	15	5,434,924	140
Total	—	30	51	103,466,370	489

KumoRFM was never trained on any RelBench dataset. All results below are true zero-shot. Baselines include LightGBM (manual feature engineering), a human data scientist, supervised RDL (GNN), and Llama 3.2 3B (LLM on serialized data).

Entity classification (12 tasks)

AUROC scores (higher is better). KumoRFM zero-shot outperforms all supervised baselines on average.
Dataset	Task	LightGBM	Data Sci.	RDL	LLM	KumoRFM (0-shot)	KumoRFM (tuned)
rel-amazon	user-churn	52.22	67.60	70.42	62.55	67.29	70.47
rel-amazon	item-churn	62.54	81.80	82.81	73.41	79.93	82.83
rel-avito	user-visits	53.05	—	66.20	53.36	64.85	78.30
rel-avito	user-clicks	53.60	—	65.90	54.07	64.11	66.83
rel-event	user-repeat	53.05	—	76.89	53.36	76.08	80.64
rel-event	user-ignore	79.93	—	81.62	68.65	89.20	89.43
rel-f1	driver-dnf	68.86	69.80	72.62	80.03	82.41	82.63
rel-f1	driver-top3	73.93	82.40	75.54	87.11	91.07	99.62
rel-hm	user-churn	55.21	69.00	69.88	63.81	67.71	71.23
rel-stack	user-engage.	63.39	90.30	90.59	81.23	87.09	90.70
rel-stack	user-badge	63.43	86.20	88.86	79.99	80.00	89.86
rel-trial	study-outcome	70.09	72.00	68.60	59.17	70.79	71.16
Average	—	62.44	—	75.83	68.06	76.71	81.14

KumoRFM zero-shot (with no training on any of these datasets) averages 76.71 AUROC, outperforming LightGBM (62.44, with manual features), the LLM baseline (68.06), and even supervised RDL (75.83, which trains a GNN per task). Fine-tuning pushes the average to 81.14.

Recommendation tasks (9 tasks)

MAP@k scores (higher is better). Fine-tuned KumoRFM achieves state-of-the-art on all 9 tasks.
Dataset	Task	LightGBM	GraphSAGE	NBFNet	KumoRFM (0-shot)	KumoRFM (tuned)
rel-amazon	user-item-purchase	0.16	0.74	0.10	1.72	2.93
rel-amazon	user-item-rate	0.17	0.87	0.12	1.14	2.25
rel-amazon	user-item-review	0.09	0.47	0.09	0.22	1.63
rel-avito	user-ad-visit	0.06	0.02	3.66	4.02	4.17
rel-hm	user-item-purchase	0.38	0.80	2.81	2.73	3.14
rel-stack	user-post-comment	0.04	0.11	12.72	11.83	13.34
rel-stack	post-post-related	2.00	0.07	10.83	11.80	12.21
rel-trial	cond.-sponsor-run	4.82	2.89	11.36	11.29	11.65
rel-trial	site-sponsor-run	8.40	10.70	19.00	20.83	28.02
Average	—	1.79	1.85	6.74	7.29	8.82

Entity regression (9 tasks)

MAE scores (lower is better). Data Scientist baseline wins 3 of 9 tasks, exclusively on high-MAE tasks where fine-grained graph reasoning offers limited value.
Dataset	Task	LightGBM	Data Sci.	RDL	KumoRFM (0-shot)	KumoRFM (tuned)
rel-amazon	user-ltv	16.783	13.928	14.313	16.161	14.226
rel-amazon	item-ltv	60.569	41.122	50.053	55.254	48.670
rel-avito	ad-ctr	0.041	—	0.041	0.035	0.034
rel-event	user-attend.	0.264	—	0.258	0.264	0.238
rel-f1	driver-position	4.170	3.963	4.022	2.747	2.731
rel-hm	item-sales	0.076	0.036	0.056	0.040	0.034
rel-stack	post-votes	0.068	0.065	0.065	0.065	0.065
rel-trial	study-adverse	44.011	40.581	44.473	58.231	44.225
rel-trial	site-success	0.425	0.407	0.400	0.417	0.301

Regression is where the results are most nuanced. KumoRFM in-context mode gets a 1.6% relative gain over the RDL baseline on average. But the Data Scientist baseline performs best on 3 of 9 tasks, exclusively on tasks with high MAE where fine-grained graph reasoning offers limited additional value. Once fine-tuned, KumoRFM outperforms all baselines on 5 of 9 tasks. This is the honest picture: KumoRFM's zero-shot mode is strongest on classification and recommendation, and fine-tuning closes the gap on regression.

Time to first prediction

KumoRFM zero-shot: ~1 second, 1 line of PQL
RDL (GNN pipeline): ~30 minutes, 56 lines of code
Manual data scientist: ~12.3 hours, 878 lines of code

10

Explainability

KumoRFM provides two levels of explanation for every prediction.

Global explanations

At the dataset level, the model organizes column-level context data into cohorts and links their distributions to ground-truth labels. For example: “Across all users, how does order frequency correlate with churn?” or “How do purchased product categories influence lifetime value predictions?” The framework accommodates column-level data across all tables by leveraging weighted cohorts in adjacent tables. To quantify importance, it computes the variance of model predictions across cohorts. Higher variance suggests greater relevance.

Local explanations

At the entity level, gradient-based saliency methods compute importance scores for individual cells within the input subgraph. A key novelty is the adaptation to multi-modal, cell-level inputs: instead of assigning scores at the feature level, scores are computed per cell using specialized aggregation routines tailored to each semantic type. This provides actionable insights at the level of individual data points. For example, which product categories of past purchases most strongly influenced a recommendation, or which transaction patterns triggered a fraud risk indicator.

Prediction accuracy evaluation

Beyond explanations, KumoRFM supports quantitative prediction accuracy evaluation. For temporal predictions, it evaluates performance using recent historical snapshots where ground-truth labels are known. It reports both performance-oriented metrics (AUROC, AP, MAE, MAPE, MAP@k) and behavioral metrics that capture qualities like diversity and popularity bias. This lets users assess prediction quality before deploying to production.

Example: churn prediction

PREDICT COUNT(orders.*, 0, 30) > 0\nFOR EACH users.user_id = 1

For this user, the strongest signals are order count (few past orders correlates with low purchase likelihood), order date recency (recent activity increases the probability), and club membership status (active members show higher purchase rates). Each factor has a quantified importance score.

Example: recommendations

PREDICT LIST_DISTINCT(orders.item_id, 0, 7)\nFOR EACH users.user_id = 2

The recommended items are driven primarily by recent browsing history (jackets and t-shirts the user viewed, importance >95%) and past order patterns (repeat purchase behavior in similar categories).

11

Zero-Shot to Production

KumoRFM operates in two modes:

In-context (zero-shot)

Point it at any database, write a PQL query, get predictions in seconds. No training, no configuration. This mode is useful for rapid hypothesis testing, data exploration, and evaluating whether a prediction task is viable before committing engineering resources.

Fine-tuned

For production deployment, fine-tuning specializes the model on a single dataset and a single task. The process replaces the table-invariant encoders with dataset-specific ones and substitutes the ICL head with a task-specific head (e.g., a link prediction head for recommendation tasks). The model is then trained in a supervised fashion on a pre-generated training table.

Fine-tuning retains the pre-trained relational representations while optimizing for your specific data, yielding 10-30% additional accuracy over zero-shot. It also enables efficient deployment at scale: instead of attending to all in-context examples at inference time, the fine-tuned model runs predictions in a single forward pass, scaling to billions of predictions.

Availability

KumoRFM (zero-shot) is available at kumorfm.ai
KumoRFM Fine-Tune is available at kumo.ai/try
Native deployment on Snowflake (Snowpark Container Services) and Databricks (Lakehouse App)