Can Kumo.ai read Delta tables directly without exporting data?

Yes. Kumo.ai is available as a Databricks Lakehouse App that connects to Unity Catalog and reads Delta tables natively. You point PQL queries at your catalog tables, and Kumo reads the relational structure (multiple tables, foreign keys, timestamps) without any data export, CSV conversion, or feature pipeline. Predictions are written back to Delta tables in your lakehouse.

What is the difference between Databricks Genie Code and Kumo.ai?

Genie Code automates the workflow: it generates notebook code, writes feature engineering logic, and orchestrates training runs. Kumo automates the prediction itself: it understands relational structure across multiple Delta tables, discovers cross-table patterns, and generates predictions without any feature code. Genie automates writing the pipeline. Kumo eliminates the need for the pipeline.

Do I need to build a feature table before using Databricks AutoML?

Yes. Databricks AutoML requires a single flat table as input. You must join your Delta tables, compute aggregations, encode categorical variables, and produce one row per entity before AutoML can run. AutoML automates model selection and hyperparameter tuning on that table, but the feature engineering step (which typically consumes 80% of the total effort) remains manual.

How does Kumo.ai integrate with Unity Catalog?

Kumo registers as a Lakehouse App in your Databricks workspace. It reads table metadata and data through Unity Catalog, respecting your existing access controls and governance policies. You reference tables by their Unity Catalog names in PQL queries (e.g., catalog.schema.customers). Prediction outputs are written back as Delta tables registered in Unity Catalog.

Can I use MLflow to track experiments if I use Kumo.ai?

Yes. Kumo produces predictions and model metadata that can be logged to MLflow for experiment tracking, model registry, and deployment management. The difference is that with Kumo, you skip the months of feature engineering that typically precede the MLflow tracking step. You go from PQL query to tracked, versioned predictions in minutes instead of weeks.

What types of predictions can I run on Databricks with Kumo?

Any predictive task that can be expressed over relational Delta tables: churn prediction, fraud detection, lead scoring, demand forecasting, recommendation, credit risk, next-best-action, and customer lifetime value. PQL supports binary classification, multi-class classification, regression, and ranking tasks. Each task is a single query against your Unity Catalog tables.

How to Add ML Predictions to Your Databricks Lakehouse | Kumo.ai

If you are on Databricks, you already have the hardest part of data infrastructure figured out. Your data lands in Delta Lake. Unity Catalog governs access. Spark handles compute. Notebooks let your team explore and transform data.

But when it comes time to add predictive ML, the options multiply and the complexity returns. Do you use Databricks AutoML? Write custom models with MLflow? Try the new Genie Code agent? Bring in DataRobot or H2O? Build a Feature Store pipeline?

Each approach makes different trade-offs on the same fundamental question: who builds the features? The answer to that question determines whether your first prediction takes minutes or months.

The headline result: SAP SALT benchmark

The SAP SALT benchmark is an enterprise-grade evaluation where real business analysts and data scientists attempt prediction tasks on SAP enterprise data. It measures how accurately different approaches predict real business outcomes on production-quality enterprise databases with multiple related tables.

sap_salt_enterprise_benchmark

approach	accuracy	what_it_means
LLM + AutoML	63%	Language model generates features, AutoML selects model
PhD Data Scientist + XGBoost	75%	Expert spends weeks hand-crafting features, tunes XGBoost
KumoRFM (zero-shot)	91%	No feature engineering, no training, reads relational tables directly

SAP SALT benchmark: KumoRFM outperforms expert data scientists by 16 percentage points and LLM+AutoML by 28 percentage points on real enterprise prediction tasks.

KumoRFM scores 91% where PhD-level data scientists with weeks of feature engineering and hand-tuned XGBoost score 75%. The 16 percentage point gap is the value of reading relational data natively instead of flattening it into a single table.

databricks_ml_options_compared

Option	Reads Delta Tables	Feature Engineering Required	Multi-Table Native	Autonomous	Time to First Prediction	Best For
Kumo.ai (Lakehouse App)	Yes, natively via Unity Catalog	None	Yes	Yes	Minutes	Multi-table predictions at scale
Databricks AutoML	Single table only	Full (joins + aggregations)	No	Partial (model selection only)	Days to weeks	Single-table problems with existing features
Databricks Genie Code	Yes (generates code to read them)	AI-generated (still flat-table)	No	Workflow only	Hours to days	Accelerating notebook-based workflows
MLflow + custom code	Yes (manual Spark reads)	Full (manual pipelines)	No	No	Weeks to months	Full control with experienced ML team
DataRobot on Databricks	Via connector	Full (requires flat features)	No	Partial (model selection only)	Days to weeks	Enterprise AutoML with governance
H2O Sparkling Water	Via Spark integration	Full (manual pipelines)	No	Partial	Weeks	Spark-native distributed training
Feature Store + AutoML	Feature Store reads Delta	Full (most complex setup)	No	Partial	Weeks to months	Mature orgs with dedicated ML platform team

Highlighted: Kumo.ai is the only option that reads multiple Delta tables natively and generates predictions without feature engineering. Every other approach requires building a flat feature table first.

The feature engineering divide

Every option in the table above falls into one of two categories: approaches that require you to build a flat feature table from your Delta tables, and approaches that read your relational Delta tables directly.

Six of the seven options require a flat feature table. That means someone on your team has to write the Spark SQL or PySpark to join your customers table with your orders table with your products table, compute aggregations like avg_order_value_last_90d and count_support_tickets_last_30d, handle temporal leakage, and produce one row per entity. This is the step that consumes 80% of the effort in every ML project.

Option 1: Kumo.ai as a Lakehouse App

Kumo.ai is available in the Databricks Marketplace as a Lakehouse App. The integration path is: install from marketplace, connect to Unity Catalog, write a PQL query, get predictions back as a Delta table.

What makes Kumo different from every other option is what happens under the hood. Kumo's relational foundation model reads your Delta tables as a temporal heterogeneous graph. Each row in each table becomes a node. Each foreign key becomes an edge. Timestamps are preserved. The model discovers predictive patterns across tables, time windows, and relationship hops without any feature engineering.

PQL Query

PREDICT churn
FOR EACH unity_catalog.sales.customers.customer_id
WHERE customers.status = 'active'

This PQL query reads directly from Delta tables registered in Unity Catalog. Kumo's foundation model traverses the relational structure (customers, orders, products, support_tickets) and generates churn predictions without any feature engineering, joins, or aggregations.

Output

customer_id	churn_probability	key_signals	confidence
C-4401	0.89	Order frequency declining + support tickets rising	High
C-4402	0.12	Stable purchase pattern + no support issues	High
C-4403	0.67	Category shift + payment method changed	Medium
C-4404	0.03	Increasing order value + new product adoption	High

No notebooks. No feature tables. No Spark jobs to maintain. The predictions land in a Delta table that any downstream process (dashboards, reverse ETL, operational systems) can consume directly.

Option 2: Databricks AutoML

Databricks AutoML is built into the workspace. You point it at a single table, it tries multiple algorithms (LightGBM, XGBoost, sklearn, Prophet), tunes hyperparameters, and produces a notebook with the winning model. It is genuinely good at model selection.

The limitation is the input requirement: a single flat table. If your prediction depends on patterns across customers, orders, and products, you must join and aggregate those tables yourself before AutoML sees the data. AutoML automates the last 20% of the pipeline. The first 80% (feature engineering) remains manual.

Option 3: Databricks Genie Code

Genie Code is Databricks' new AI agent that generates notebook code. You describe what you want in natural language, and Genie writes PySpark, SQL, and ML code to accomplish it. It can generate feature engineering code, training scripts, and evaluation logic.

This is a genuine productivity improvement. Instead of writing feature pipelines by hand, you describe them and Genie writes the code. But the underlying approach is unchanged: Genie still produces a flat feature table and trains a single-table model. It automates the workflow (writing code, running notebooks). It does not automate the prediction (understanding relational structure).

Genie Code (automates the workflow)

Generates PySpark code to join tables
Writes feature engineering logic
Produces a flat feature table
Trains a single-table model
Still requires human review of generated features

Kumo.ai (automates the prediction)

Reads Delta tables as relational graph
Discovers cross-table patterns automatically
No flat feature table needed
Foundation model understands relational structure
Predictions in minutes with zero code review

Option 4: MLflow + custom models

MLflow is the backbone of ML operations on Databricks. It tracks experiments, versions models, manages artifacts, and handles deployment. If you have a strong ML team that wants full control, MLflow + custom PySpark/sklearn/PyTorch code gives you maximum flexibility.

The trade-off is effort. Your team writes the feature pipelines, selects the algorithms, tunes hyperparameters, and maintains everything. MLflow tracks all of this beautifully. But the 80% of time spent on feature engineering happens before MLflow enters the picture. MLflow tracks what you built. It does not build it for you.

where_mlflow_time_actually_goes

Stage	Hours per task	% of total	MLflow helps?
Delta table joins & prep	2.5 hours	17%	No
Feature computation (Spark)	5.0 hours	34%	No
Feature iteration & selection	4.2 hours	29%	Tracks experiments only
Model training & tuning	1.8 hours	12%	Yes (full tracking)
Evaluation & deployment	1.2 hours	8%	Yes (model registry)

Highlighted: 80% of the work happens before MLflow's tracking capabilities become relevant. MLflow is excellent infrastructure for the last 20%. It does not address the first 80%.

Option 5: DataRobot on Databricks

DataRobot integrates with Databricks via Spark connectors and can read from Unity Catalog. It brings enterprise AutoML with strong governance, explainability, and deployment features. Like Databricks AutoML, it requires a flat feature table as input.

DataRobot adds value over native Databricks AutoML in model governance, monitoring, and compliance documentation. But the core limitation is the same: it optimizes over a pre-engineered feature table. Cross-table patterns that were not manually encoded as features are invisible to DataRobot.

Option 6: H2O Sparkling Water

H2O Sparkling Water runs H2O's algorithms directly on Spark clusters. This gives you distributed training at scale without moving data out of Databricks. The integration is mature and well-tested.

Like every other option except Kumo, H2O requires a flat feature table. You write PySpark to join and aggregate your Delta tables, then H2O trains models on the result. The feature engineering bottleneck remains fully manual.

Option 7: Feature Store + AutoML

Databricks Feature Store (now part of Unity Catalog) lets you define, compute, and serve features as managed tables. Combined with AutoML, this is the most "Databricks-native" approach to production ML.

It is also the most complex. You define feature tables, write compute functions, schedule refresh jobs, manage point-in-time correctness, handle feature serving, and then feed the feature table to AutoML. This is the right approach for organizations with dedicated ML platform teams and dozens of models in production. For teams trying to get their first prediction live, it is months of infrastructure work before the first model trains.

The real question: who builds the features?

Every approach in this guide answers a slightly different question. But they all come back to the same bottleneck: converting your multi-table Delta Lake data into a flat feature table that a model can consume.

who_builds_the_features

Approach	Who writes the feature code?	Feature engineering effort
Kumo.ai	Nobody (foundation model reads raw tables)	Zero
Databricks AutoML	Your data scientists	Full manual effort
Genie Code	AI generates code, humans review	Reduced but not eliminated
MLflow + custom	Your ML engineers	Full manual effort
DataRobot	Your data scientists	Full manual effort
H2O Sparkling Water	Your ML engineers	Full manual effort
Feature Store + AutoML	Your ML platform team	Full manual effort (most structured)

Highlighted: Kumo is the only option where nobody writes feature engineering code. The foundation model discovers predictive patterns directly from your relational Delta tables.

How Kumo reads your lakehouse differently

To understand why Kumo eliminates the feature engineering step, consider what the other tools see versus what Kumo sees when pointed at the same Unity Catalog tables.

what_automl_sees_vs_what_kumo_sees

Delta table	What AutoML/MLflow/DataRobot see	What Kumo's foundation model sees
customers	Source table for manual joins	Entity nodes with temporal attributes
orders	Source table for aggregation SQL	Event nodes linked to customers and products
products	Source for one-hot encoding	Attribute nodes with category relationships
support_tickets	Source for count/recency features	Signal nodes with temporal patterns
Relationships between tables	Invisible (lost in flattening)	Graph edges preserving full relational structure

Every other tool requires you to flatten the relational structure into a single table, losing cross-table patterns in the process. Kumo preserves the full relational structure as a temporal graph.

When to use each option

The right choice depends on your team, your data, and your timeline:

Kumo.ai Lakehouse App: You have multi-table Delta data and want predictions without building feature pipelines. You want your first prediction in minutes, not months. Your team's time is better spent on business problems than feature engineering.
Databricks AutoML: You already have a flat feature table or single-table data. You want a quick baseline model with minimal setup. Your data does not require multi-table joins.
Genie Code: You want AI assistance writing notebook code. Your team is comfortable reviewing generated code. You want to accelerate existing notebook-based workflows.
MLflow + custom: You have a strong ML team that wants full control. You need custom model architectures or domain-specific feature engineering. You already have feature pipelines in production.
DataRobot: You need enterprise governance and compliance documentation on top of AutoML. Your organization has regulatory requirements for model explainability.
H2O Sparkling Water: You need distributed training at scale on Spark. Your team has H2O expertise.
Feature Store + AutoML: You have a dedicated ML platform team, dozens of models in production, and the resources to build and maintain feature infrastructure.

PQL Query

PREDICT fraud_flag
FOR EACH unity_catalog.payments.transactions.txn_id
WHERE transactions.timestamp > '2026-03-01'

Fraud detection on Delta tables with a single PQL query. Kumo's foundation model reads transactions, accounts, merchants, and device tables from Unity Catalog, discovers cross-table anomaly patterns, and returns fraud probabilities. No Spark feature pipeline required.

Output

txn_id	fraud_probability	risk_tier	tables_used
T-88201	0.94	Critical	transactions, accounts, merchants, devices
T-88202	0.07	Low	transactions, accounts
T-88203	0.71	High	transactions, accounts, merchants
T-88204	0.02	Low	transactions, accounts

The bottom line

Databricks has built the best data lakehouse platform in the industry. But a data platform is not a prediction platform. Adding ML predictions still requires choosing who builds the features and how models get trained.

Six of the seven options on this page require you to solve the feature engineering problem yourself (manually, with AI code generation, or through Feature Store infrastructure). One option eliminates it entirely by reading your relational Delta tables as they are.

If your team has been spending weeks or months building feature pipelines before any model trains, the issue is not which AutoML tool you use on the flat table at the end. The issue is that you are building the flat table at all.

Key Takeaways

1Databricks provides seven paths to ML predictions, but six of them require building a flat feature table from your Delta tables first. Kumo.ai is the only option that reads multi-table relational data natively via Unity Catalog.
2Databricks Genie Code and Kumo automate different layers: Genie automates the workflow (generates code). Kumo automates the prediction (understands relational structure). The prediction layer is where 80% of the effort lives.
3Teams using MLflow + custom pipelines spend 80% of their time on feature engineering before any experiment tracking begins. MLflow tracks what you build. It does not build it for you.
4Kumo runs as a Lakehouse App: install from marketplace, connect to Unity Catalog, write a PQL query, get predictions as a Delta table. No notebooks, no feature pipelines, no Spark jobs to maintain.
5The right option depends on your situation. Single-table data with existing features: use AutoML. Full control with ML team: use MLflow. Multi-table predictions without feature engineering: use Kumo.

How to Add ML Predictions to Your Databricks Lakehouse

The headline result: SAP SALT benchmark

The feature engineering divide

Option 1: Kumo.ai as a Lakehouse App

Option 2: Databricks AutoML

Option 3: Databricks Genie Code

Option 4: MLflow + custom models

Option 5: DataRobot on Databricks

Option 6: H2O Sparkling Water

Option 7: Feature Store + AutoML

The real question: who builds the features?

How Kumo reads your lakehouse differently

When to use each option

The bottom line

Frequently asked questions

Can Kumo.ai read Delta tables directly without exporting data?

What is the difference between Databricks Genie Code and Kumo.ai?

Do I need to build a feature table before using Databricks AutoML?

How does Kumo.ai integrate with Unity Catalog?

Can I use MLflow to track experiments if I use Kumo.ai?

What types of predictions can I run on Databricks with Kumo?

Related topics

See it in action