Why are ML pipelines so complex?

ML pipelines are complex because they bridge a structural gap between how data is stored (relational tables) and how models consume data (flat feature vectors). This gap requires data extraction, joining, aggregation, feature engineering, feature storage, model training, validation, deployment, monitoring, and retraining. Each stage introduces its own infrastructure, failure modes, and maintenance burden. Google's research found that only 5% of ML system code is the model itself; the other 95% is infrastructure.

What are the main stages of a typical ML pipeline?

A production ML pipeline typically has 6-8 stages: (1) data extraction from source databases, (2) feature engineering (SQL joins, aggregations, time windows), (3) feature storage and serving, (4) model training and hyperparameter tuning, (5) model validation and testing, (6) deployment to production, (7) monitoring for data drift and model decay, (8) scheduled retraining. Each stage requires different tools, different teams, and ongoing maintenance.

How much does ML pipeline maintenance cost?

Industry surveys consistently show that ML teams spend 40-60% of their time on pipeline maintenance rather than building new models. At a Fortune 500 company with a 20-person data science team at an average cost of $200K per person, that is $1.6M-$2.4M per year spent on maintaining existing pipelines rather than creating new value. The infrastructure costs (compute, storage, orchestration tools) add another $500K-$2M per year.

What is a feature pipeline and why is it the main bottleneck?

A feature pipeline is the code that transforms raw data from multiple source tables into the flat feature table that a model needs. It includes SQL queries for joining and aggregating data, time-window logic to prevent data leakage, data quality checks, and serving infrastructure to deliver features at prediction time. Feature pipelines account for 60-80% of ML pipeline code and are the primary source of bugs, data leakage, and maintenance burden.

How do foundation models reduce pipeline complexity?

Foundation models like KumoRFM eliminate the feature pipeline entirely by learning directly from raw relational data. Instead of extracting, joining, aggregating, and engineering features, you connect the database and write a predictive query. The model handles feature discovery, pattern learning, and prediction internally. This removes 4-5 pipeline stages (extraction, feature engineering, feature storage, training, retraining) and reduces the infrastructure to a database connection and an API.

How to Reduce ML Pipeline Complexity by 90% | Kumo.ai

Google published a paper in 2015 titled "Hidden Technical Debt in Machine Learning Systems." The central finding: only 5% of the code in a production ML system is the model itself. The other 95% is data extraction, feature engineering, feature validation, training infrastructure, model serving, monitoring, and the glue that holds it all together.

Nine years later, this ratio has not changed. Industry surveys from Anaconda, MLOps Community, and Gartner consistently show that data science teams spend 40-60% of their time on pipeline maintenance. Not building new models. Not improving existing ones. Maintaining the infrastructure that exists solely to transform relational data into flat tables and keep those transformations running.

This is the structural problem. And it is solvable.

production_ml_pipeline — stages and effort

Stage	Tools Required	Team	Typical Duration	Annual Maintenance
1. Data Extraction	Airflow, Prefect, dbt	Data Engineer	4-8 weeks	$50K-100K
2. Feature Engineering	SQL, Python, Spark	Data Scientist	6-12 weeks	$150K-300K
3. Feature Storage	Tecton, Feast, Redis	ML Engineer	2-4 weeks	$100K-300K
4. Model Training	XGBoost, PyTorch, MLflow	Data Scientist	2-4 weeks	$50K-100K
5. Model Validation	Custom scripts, fairness tools	Data Scientist	1-2 weeks	$20K-50K
6. Deployment	Docker, K8s, SageMaker	ML Engineer	2-4 weeks	$50K-150K
7. Monitoring	Evidently, Arize, WhyLabs	ML Engineer	1-2 weeks	$50K-150K
8. Retraining	Airflow + stages 1-6	Full team	Ongoing	$200K-500K

Highlighted: feature engineering and feature storage account for 60-80% of total pipeline effort and maintenance cost.

Anatomy of a production ML pipeline

A typical ML pipeline for a prediction task on enterprise data has 6 to 8 stages. Each stage has its own tools, failure modes, and maintenance requirements.

Stage 1: Data extraction

Pull data from source systems: the data warehouse, transactional databases, event streams, CRM, and third-party APIs. This requires scheduling (Airflow, Prefect, Dagster), connection management, schema monitoring, and incremental load logic. When a source schema changes (and it always changes), the extraction pipeline breaks.

Stage 2: Feature engineering

Transform raw data into features. This is the longest stage: writing SQL joins across 5-15 tables, computing aggregations (sum, count, average, max, min) over multiple time windows (7, 30, 60, 90 days), creating derived features (ratios, rates, deltas), and encoding categorical variables. A Stanford study measured this at 12.3 hours and 878 lines of code per prediction task for experienced data scientists.

Here is what that looks like in practice. A telecom company wants to predict churn. The raw data lives in three tables:

subscribers

subscriber_id	plan	tenure_months	monthly_charge
S-4001	Unlimited	34	$89
S-4002	Basic	8	$45
S-4003	Family	22	$120

usage_events

event_id	subscriber_id	event_type	timestamp	data_mb
E-001	S-4001	Data	2025-03-01 08:14	142
E-002	S-4001	Voice	2025-03-01 09:30	—
E-003	S-4002	Data	2025-03-01 11:45	8
E-004	S-4002	Data	2025-02-15 14:20	12
E-005	S-4003	Data	2025-03-02 10:00	340

S-4002 has only 2 usage events in 2 weeks, with minimal data consumption. S-4001 uses data and voice on the same day. These temporal patterns matter.

support_interactions

ticket_id	subscriber_id	category	date	resolution
T-101	S-4002	Coverage complaint	2025-02-20	Unresolved
T-102	S-4002	Cancel request	2025-03-01	Pending
T-103	S-4003	Billing question	2025-02-15	Resolved

S-4002 escalated from a coverage complaint to a cancellation request in 9 days. This sequence is the strongest churn signal in the data.

Stage 2 flattens these three tables into a single row per subscriber:

flat_feature_table (what the model sees after 12.3 hours of engineering)

subscriber_id	usage_events_30d	avg_data_mb	ticket_count	tenure	churned
S-4001	2	142	0	34	?
S-4002	2	10	2	8	?
S-4003	1	340	1	22	?

The flattened table compresses ticket categories into a count. S-4002's escalation from complaint to cancellation is invisible. S-4003's resolved billing question looks identical to an unresolved cancellation.

Stage 3: Feature storage and serving

Store computed features for training and serve them at prediction time. Feature stores (Tecton, Feast, Hopsworks) manage this but add infrastructure complexity: online stores for low-latency serving, offline stores for training, materialization jobs to keep them in sync, and TTL policies to manage stale data.

Stage 4: Model training

Train the model on the feature table. This includes hyperparameter tuning, cross-validation, experiment tracking (MLflow, Weights & Biases), and compute management (GPU allocation, distributed training). This stage is what most people think of as "ML," but it represents only 10-15% of the total pipeline effort.

Stage 5: Model validation

Validate the model against held-out data, check for data leakage, test fairness across protected groups, compare against the current production model, and generate documentation for model governance. At regulated companies (finance, healthcare), this stage alone can take weeks.

Stage 6: Deployment

Serve the model in production: containerize (Docker), deploy (Kubernetes, SageMaker, Vertex AI), set up auto-scaling, configure A/B testing, and integrate with the application layer. The model is now live, but the work is not done.

Stage 7: Monitoring

Monitor for data drift (input distributions changing), model decay (accuracy degrading over time), feature freshness (stale data in the feature store), and infrastructure health (latency, errors). Set up alerting for each.

Stage 8: Retraining

When monitoring detects decay, retrain the model. This triggers stages 1-6 again: re-extract data, re-compute features, retrain, revalidate, redeploy. Most organizations retrain monthly or quarterly. Some retrain weekly. Each retraining cycle requires human oversight.

Where the complexity actually lives

Not all pipeline stages are equally complex. The distribution of effort is highly skewed.

High-complexity stages (85% of effort)

Feature engineering: 878 lines of SQL/Python per task
Feature storage: Online/offline sync, materialization jobs
Data extraction: Schema monitoring, incremental loads
Monitoring: Drift detection, freshness checks, alerting
Retraining: Full pipeline re-execution monthly/quarterly

Low-complexity stages (15% of effort)

Model training: 50-100 lines, well-tooled with AutoML
Validation: Standardized metrics, automated testing
Deployment: Containerized, one-click with modern platforms
Hyperparameter tuning: Automated with Optuna/Ray Tune
Experiment tracking: Mature tools (MLflow, W&B)

The pattern is clear: the stages that are hard are the ones that deal with the gap between relational data and flat features. Data extraction, feature engineering, feature storage, and the monitoring and retraining that keeps them running. The stages that are easy are the ones that operate on the flat feature table after it exists: training, tuning, validation.

The ML industry has spent a decade building better tools for the easy stages (AutoML, experiment tracking, model serving platforms) while leaving the hard stages manual. This is like optimizing the last mile of a marathon while ignoring the first 25 miles.

pipeline_cost_comparison — traditional vs foundation model

Cost Category	Traditional (10 use cases)	Foundation Model (10 use cases)
Data Science Team	$2M-4M/year	$200K-400K/year
Infrastructure & Tooling	$1M-3M/year	$100K-300K/year
Feature Store Licensing	$100K-300K/year	$0
Monitoring & MLOps	$200K-500K/year	Included
Annual Maintenance	$3M-10M/year	$50K-100K/year
Total (Year 1)	$5M-20M	$500K-1.5M
Total (3-Year)	$16M-50M	$1.5M-4.5M

Foundation model approach reduces total 3-year cost by 75-90% by eliminating feature pipelines and per-use-case engineering.

What a simplified pipeline looks like

If you eliminate the gap between relational data and flat features, four of the eight stages disappear entirely.

Data extraction: eliminated. The model connects directly to the relational database. No extraction pipeline, no schema monitoring, no incremental load logic.

Feature engineering: eliminated. The model learns directly from raw relational data. No SQL joins, no aggregation queries, no time-window features.

Feature storage: eliminated. There are no computed features to store, serve, or keep fresh. The model reads the data at prediction time.

Retraining: eliminated (for zero-shot) or simplified (for fine-tuning). A foundation model that is pre-trained on relational patterns does not need task-specific training for many use cases. For cases where fine-tuning improves accuracy, the retraining cycle is minutes, not weeks.

What remains: connect the database, write a predictive query, validate the output, deploy the predictions. Four stages that take hours instead of months.

The foundation model approach

KumoRFM implements this simplified pipeline. The model is pre-trained on billions of relational patterns across thousands of databases. It has already learned the universal patterns that predict outcomes in relational data: recency, frequency, temporal dynamics, graph topology, cross-table signal propagation.

The production workflow is:

1. Connect. Point KumoRFM at your Snowflake, Databricks, BigQuery, or PostgreSQL database. The model reads the schema and maps tables and relationships automatically.

2. Query. Write a one-line predictive query:

PQL Query

PREDICT churn_90d
FOR EACH customers.customer_id

This single query replaces stages 2-4 of the traditional pipeline: feature engineering, feature storage, and model training. The foundation model handles all three internally.

Output

customer_id	churn_90d	confidence
C-10042	0.82	0.93
C-10043	0.14	0.91
C-10044	0.67	0.88
C-10045	0.03	0.96

3. Deploy. Predictions are available via API or written back to your data warehouse. No model serving infrastructure. No containerization. No Kubernetes.

The entire pipeline, from connected database to production predictions, is measured in minutes. There is no feature pipeline to maintain. No retraining schedule to manage. No drift to monitor in a feature store that does not exist. When the underlying data changes, the model adapts automatically because it reads the data at prediction time.

What this means for ML teams

The implication is not that ML teams become unnecessary. It is that they spend their time differently. Instead of 60% on pipeline maintenance and 40% on new work, the ratio flips.

More predictions, faster. A team that previously shipped 3-4 new models per year (each taking 3-6 months to build and deploy) can now ship 50+ predictions per quarter. Each new prediction question is a query, not a project.

Higher-value work. Data scientists spend time on business problem framing, result interpretation, and stakeholder communication instead of writing SQL joins and debugging feature pipelines.

Lower infrastructure cost. Eliminating feature stores, training clusters, orchestration systems, and monitoring infrastructure reduces cloud spend by 40-70% for prediction workloads.

The complexity of ML pipelines is not inherent to machine learning. It is an artifact of the gap between relational data and flat-table models. Close the gap with a model that learns directly from relational data, and 90% of the pipeline evaporates. What remains is the part that creates value: asking the right questions and acting on the answers.

Key Takeaways

1Only 5% of production ML code is the model. The other 95% is data extraction, feature engineering, feature storage, serving, monitoring, and retraining infrastructure.
2Feature engineering and feature pipelines account for 60-80% of total pipeline effort and are the primary source of bugs, data leakage, and maintenance burden.
3Maintenance costs 30-50% of initial build cost annually. A $1M pipeline costs $300K-500K per year to keep running, and this cost never stops.
4Foundation models eliminate 4-5 of the 8 pipeline stages by learning directly from raw relational data. No feature engineering, no feature store, no extraction pipeline, no retraining cycle.
5The simplified pipeline (connect, query, deploy) reduces time-to-prediction from months to minutes and total cost by 75-90% across multiple use cases.

How to Reduce ML Pipeline Complexity by 90%