You have terabytes of data in Snowflake. Customer transactions, product interactions, support tickets, account hierarchies. You know there are predictive signals buried in those tables. You want ML predictions: who will churn, which leads will convert, where is the fraud, what will demand look like next quarter.
The traditional approach is to extract data from Snowflake, move it to an ML platform, build features, train models, and push predictions back. This works, but it creates a cascade of problems that get worse at scale.
The headline result: SAP SALT benchmark
The SAP SALT benchmark is an enterprise-grade evaluation where real business analysts and data scientists attempt prediction tasks on SAP enterprise data. It measures how accurately different approaches predict real business outcomes on production-quality enterprise databases with multiple related tables.
sap_salt_enterprise_benchmark
| approach | accuracy | what_it_means |
|---|---|---|
| LLM + AutoML | 63% | Language model generates features, AutoML selects model |
| PhD Data Scientist + XGBoost | 75% | Expert spends weeks hand-crafting features, tunes XGBoost |
| KumoRFM (zero-shot) | 91% | No feature engineering, no training, reads relational tables directly |
SAP SALT benchmark: KumoRFM outperforms expert data scientists by 16 percentage points and LLM+AutoML by 28 percentage points on real enterprise prediction tasks.
KumoRFM scores 91% where PhD-level data scientists with weeks of feature engineering and hand-tuned XGBoost score 75%. The 16 percentage point gap is the value of reading relational data natively instead of flattening it into a single table.
The data gravity problem
Data gravity is a simple concept: as data accumulates in one place, it becomes increasingly expensive and risky to move it somewhere else. Your Snowflake warehouse is not just storage. It is a governed environment with access controls, audit trails, encryption at rest, role-based permissions, and compliance certifications.
Every time you move data out of Snowflake for ML, you break that governance chain:
- Security risk. Data in transit to external ML platforms creates new attack surfaces. Data at rest in a second system doubles your compliance burden. If the ML platform stores customer PII, you now have two systems to audit, two systems to patch, and two systems to include in your SOC 2 report.
- Staleness. ETL pipelines run on schedules. If your pipeline runs nightly, your predictions are always based on yesterday's data. For fraud detection and real-time personalization, a 24-hour delay can be the difference between catching a fraudulent transaction and missing it.
- Latency. Moving terabytes of data takes time. Even with optimized connectors, extracting a large Snowflake dataset to S3 and loading it into a training platform adds hours to every iteration cycle.
- Governance gaps. Data lineage breaks when data leaves Snowflake. Your data catalog tracks the Snowflake tables, but does it track the transformed features in the ML platform? Column-level access controls in Snowflake do not propagate to external tools.
Every option for ML predictions on Snowflake data
There are seven realistic approaches for running ML predictions on Snowflake data. They differ dramatically in whether data moves, how much engineering is required, and what types of predictions they support.
ml_on_snowflake_options
| Option | Runs Natively in Snowflake | Data Movement Required | Feature Engineering Required | Multi-Table | Time to First Prediction | Best For |
|---|---|---|---|---|---|---|
| Kumo.ai (Snowpark Container Services) | Yes | None | None | Yes (reads relational tables directly) | Minutes | Multi-table relational predictions (churn, fraud, conversion, demand) |
| Snowflake ML Functions | Yes | None | Minimal | No (single table only) | Minutes | Single-table time series forecasting and anomaly detection |
| Snowflake Cortex | Yes | None | None | No | Minutes | Text analysis, sentiment, summarization, LLM-based tasks |
| DataRobot on Snowflake | Partial (connector) | Some (to DataRobot) | Yes (flat table required) | No | Days to weeks | AutoML on pre-engineered feature tables |
| H2O on Snowflake | Partial (Snowpark) | Some (to H2O runtime) | Yes (feature engineering manual) | No | Days to weeks | AutoML with Snowpark integration |
| Custom (Python + Snowpark) | Yes (Snowpark) | Minimal | Yes (fully manual) | Yes (if you build it) | Weeks to months | Full flexibility, custom models, unique requirements |
| dbt + ML | Partial | Yes (model training external) | Yes (dbt transforms) | Partial (dbt joins) | Weeks to months | Teams already using dbt for data transformation |
Highlighted: Kumo is the only option that runs natively in Snowflake, requires no data movement, requires no feature engineering, and handles multi-table relational data. Snowflake's native options (ML Functions and Cortex) are excellent but limited to single-table time series and text analysis respectively.
Option 1: Kumo.ai on Snowpark Container Services
Kumo runs as a container inside Snowpark Container Services. This is not a connector or API integration. Kumo's compute runs within your Snowflake account, governed by your Snowflake access controls and network policies. Your data never leaves your Snowflake perimeter.
How it works
Kumo reads your Snowflake tables directly using its Predictive Query Language (PQL). It builds an in-memory graph representation of your relational schema, where each table becomes a set of nodes and each foreign key relationship becomes a set of edges. The KumoRFM foundation model, pre-trained on thousands of diverse relational databases, processes this graph and generates predictions without any feature engineering or model training.
PQL Query
PREDICT churn_30d FOR EACH customers.customer_id USING TABLE snowflake_db.customers TABLE snowflake_db.orders TABLE snowflake_db.products TABLE snowflake_db.support_tickets
This PQL query reads four Snowflake tables directly and predicts 30-day churn for every customer. No feature engineering. No data movement. No model training. Kumo discovers multi-table patterns (purchase frequency, product return rates, support ticket escalations) automatically from the raw relational data.
Output
| customer_id | churn_30d_prob | key_signals | data_sources_used |
|---|---|---|---|
| C-4401 | 0.87 | 3 support escalations, declining order frequency | customers, orders, support_tickets |
| C-4402 | 0.12 | Increasing AOV, new product categories | customers, orders, products |
| C-4403 | 0.64 | High-return products, single category | customers, orders, products |
| C-4404 | 0.03 | Multi-category buyer, zero tickets | All 4 tables |
Predictions are written back to a Snowflake table. Downstream consumers (dashboards, reverse ETL, applications) read them from Snowflake just like any other table. There is no new integration to build.
Snowflake Intelligence integration
KumoRFM also powers predictions inside Snowflake Intelligence, Snowflake's AI-powered analytics layer. This means business users who work inside Snowflake can access foundation-model-quality predictions without ever building an ML pipeline. The predictions are generated by the same KumoRFM model, surfaced through Snowflake's native interface.
Option 2: Snowflake ML Functions
Snowflake provides built-in ML functions for forecasting and anomaly detection. These are SQL-callable functions that run entirely inside Snowflake with zero data movement.
The limitation is scope. Snowflake ML Functions operate on a single table with a time series structure (entity, timestamp, value). They are excellent for straightforward forecasting (predict next month's revenue per product) and anomaly detection (flag unusual transaction volumes). They cannot handle multi-table relational predictions, classification tasks, or problems that require cross-table pattern discovery.
Option 3: Snowflake Cortex
Snowflake Cortex provides LLM-based functions for text data: sentiment analysis, summarization, classification, and extraction. It runs inside Snowflake and is callable via SQL.
Cortex is not designed for structured data prediction. If you need to analyze customer support ticket text or classify product reviews, Cortex is excellent. If you need to predict churn based on transaction patterns across relational tables, Cortex is not the right tool.
Option 4: DataRobot on Snowflake
DataRobot offers a Snowflake integration that reads data via connector and runs AutoML on it. The integration has improved significantly, but the fundamental architecture still requires a flat feature table as input.
This means someone on your team still needs to join your Snowflake tables into a single flat table, compute aggregations, encode categorical variables, and handle time windows. DataRobot automates model selection and tuning (the last 20% of the pipeline), but the feature engineering (the first 80%) remains manual. Data also moves to DataRobot's compute environment for training, which means it leaves your Snowflake perimeter.
Option 5: H2O on Snowflake
H2O provides AutoML capabilities with Snowpark integration. Like DataRobot, H2O requires a pre-engineered feature table. The Snowpark integration means some processing can happen inside Snowflake, but model training typically runs on H2O's compute.
The same feature engineering bottleneck applies: someone needs to build the flat table before H2O can select and tune models on it. For a 5-table schema, that is still 12+ hours of data science work per prediction task.
Option 6: Custom Python + Snowpark
Snowpark allows you to run custom Python code inside Snowflake. You can write scikit-learn, XGBoost, or PyTorch code that executes on Snowflake's compute. This gives you full flexibility with minimal data movement.
The catch is that you are building everything from scratch: feature engineering, model architecture, training loops, evaluation, serving. This is the right choice if you have a dedicated ML team and unique requirements that no platform covers. It is the wrong choice if you want predictions in days rather than months.
Option 7: dbt + ML
Some teams use dbt to build feature engineering pipelines inside Snowflake, then export the feature table to an external ML platform for training. The features are computed natively in Snowflake (good for governance), but the model training happens externally (data movement required).
This approach is popular with teams already using dbt for data transformation. The downside is pipeline complexity: you now maintain a dbt project for features, an ML platform for training, a serving layer for inference, and an orchestrator to tie them together. Each prediction task adds more dbt models, more training jobs, and more things that can break.
Traditional: Move data out of Snowflake
- ETL pipeline extracts data to S3 or external platform
- Data leaves your Snowflake governance perimeter
- Feature engineering takes 12+ hours per task
- Predictions based on stale data (ETL lag)
- Two systems to audit, secure, and maintain
Native: Run ML inside Snowflake
- Kumo reads Snowflake tables directly via Snowpark
- Data never leaves your Snowflake account
- Zero feature engineering (PQL interface)
- Predictions on current data (no ETL lag)
- One system for data, governance, and predictions
Why multi-table matters
The most important column in the comparison table above is "Multi-Table." Most valuable enterprise predictions depend on patterns that span multiple tables:
- Churn depends on transactions, support tickets, product usage, and account metadata. No single table contains enough signal.
- Fraud depends on transaction patterns, account history, merchant relationships, and device fingerprints across multiple linked tables.
- Conversion depends on marketing touches, content engagement, account firmographics, and sales activities stored across CRM tables.
- Demand forecasting depends on order history, product attributes, seasonal patterns, and promotional calendars spread across the schema.
Single-table tools (Snowflake ML Functions, Cortex, standard AutoML) miss these cross-table patterns entirely. They can only use signals that exist in one table. The predictive power locked in table relationships, the multi-hop patterns where a customer's churn risk depends on the return rates of the products they bought, is invisible to them.
AUROC (Area Under the Receiver Operating Characteristic curve) measures how well a model distinguishes between positive and negative outcomes. An AUROC of 50 means random guessing, 100 means perfect prediction. Moving from 65 to 77 AUROC means the model correctly ranks a true positive above a true negative 77% of the time instead of 65%.
multi_table_signal_value
| Prediction Task | Tables Needed | Single-Table Accuracy | Multi-Table Accuracy | Accuracy Gain |
|---|---|---|---|---|
| Customer churn | Customers, Orders, Products, Support | ~65 AUROC | ~81 AUROC | +16 points |
| Fraud detection | Transactions, Accounts, Merchants, Devices | ~70 AUROC | ~88 AUROC | +18 points |
| Lead conversion | Leads, Activities, Accounts, Opportunities | ~62 AUROC | ~79 AUROC | +17 points |
| Demand forecasting | Orders, Products, Promotions, Seasonality | ~58 MAE | ~41 MAE | 29% lower error |
Multi-table predictions consistently outperform single-table predictions by 15-18 AUROC points. The signals locked in table relationships are often more predictive than any single-table feature.
The architecture decision
Choosing an ML approach for Snowflake data comes down to three questions:
- Does your data need to stay in Snowflake? If security, governance, or compliance requires that data not leave your Snowflake account, your options narrow to Snowflake ML Functions, Snowflake Cortex, Kumo (Snowpark Container Services), and custom Snowpark code.
- Do your predictions span multiple tables? If yes, Snowflake ML Functions and Cortex are out. Your options are Kumo, custom Snowpark code, or an external platform with data movement.
- How fast do you need predictions? If you need a first prediction in minutes rather than weeks, Kumo and Snowflake's native functions are the only options. Custom code and AutoML platforms require feature engineering time.
For multi-table relational predictions on Snowflake data without data movement, Kumo via Snowpark Container Services is the only option that checks all three boxes simultaneously: native execution, multi-table support, and minutes to first prediction.
decision_matrix
| Requirement | Kumo (SPCS) | SF ML Functions | SF Cortex | DataRobot | H2O | Custom Snowpark | dbt + ML |
|---|---|---|---|---|---|---|---|
| Data stays in Snowflake | Yes | Yes | Yes | No | No | Yes | Partial |
| Multi-table relational | Yes | No | No | No | No | If built | Partial |
| Zero feature engineering | Yes | Mostly | Yes | No | No | No | No |
| Minutes to first prediction | Yes | Yes | Yes | No | No | No | No |
| Production-grade accuracy | Yes | Limited scope | Limited scope | Yes | Yes | Depends | Depends |
Highlighted: multi-table relational prediction is the key differentiator. Most Snowflake ML options are single-table only. Kumo is the only native option that handles the multi-table relational predictions that drive the highest business value.
What this means for your Snowflake investment
If you chose Snowflake as your data warehouse, you made a bet on centralized, governed data. Moving data out for ML undermines that bet. Every external ML platform you add creates a shadow copy of your data outside your governance perimeter.
Running ML natively inside Snowflake is not just a technical preference. It is the logical extension of the data gravity decision you already made. Your data is in Snowflake because you want it governed, secure, and centralized. Your ML predictions should be there too.
With Kumo on Snowpark Container Services, you get foundation-model-quality predictions on multi-table relational data, running inside your Snowflake account, accessible via PQL, with results written back to Snowflake tables. No data movement. No feature engineering. No new systems to secure.