Google published a paper in 2015 titled "Hidden Technical Debt in Machine Learning Systems." The central finding: only 5% of the code in a production ML system is the model itself. The other 95% is data extraction, feature engineering, feature validation, training infrastructure, model serving, monitoring, and the glue that holds it all together.
Nine years later, this ratio has not changed. Industry surveys from Anaconda, MLOps Community, and Gartner consistently show that data science teams spend 40-60% of their time on pipeline maintenance. Not building new models. Not improving existing ones. Maintaining the infrastructure that exists solely to transform relational data into flat tables and keep those transformations running.
This is the structural problem. And it is solvable.
production_ml_pipeline — stages and effort
| Stage | Tools Required | Team | Typical Duration | Annual Maintenance |
|---|---|---|---|---|
| 1. Data Extraction | Airflow, Prefect, dbt | Data Engineer | 4-8 weeks | $50K-100K |
| 2. Feature Engineering | SQL, Python, Spark | Data Scientist | 6-12 weeks | $150K-300K |
| 3. Feature Storage | Tecton, Feast, Redis | ML Engineer | 2-4 weeks | $100K-300K |
| 4. Model Training | XGBoost, PyTorch, MLflow | Data Scientist | 2-4 weeks | $50K-100K |
| 5. Model Validation | Custom scripts, fairness tools | Data Scientist | 1-2 weeks | $20K-50K |
| 6. Deployment | Docker, K8s, SageMaker | ML Engineer | 2-4 weeks | $50K-150K |
| 7. Monitoring | Evidently, Arize, WhyLabs | ML Engineer | 1-2 weeks | $50K-150K |
| 8. Retraining | Airflow + stages 1-6 | Full team | Ongoing | $200K-500K |
Highlighted: feature engineering and feature storage account for 60-80% of total pipeline effort and maintenance cost.
Anatomy of a production ML pipeline
A typical ML pipeline for a prediction task on enterprise data has 6 to 8 stages. Each stage has its own tools, failure modes, and maintenance requirements.
Stage 1: Data extraction
Pull data from source systems: the data warehouse, transactional databases, event streams, CRM, and third-party APIs. This requires scheduling (Airflow, Prefect, Dagster), connection management, schema monitoring, and incremental load logic. When a source schema changes (and it always changes), the extraction pipeline breaks.
Stage 2: Feature engineering
Transform raw data into features. This is the longest stage: writing SQL joins across 5-15 tables, computing aggregations (sum, count, average, max, min) over multiple time windows (7, 30, 60, 90 days), creating derived features (ratios, rates, deltas), and encoding categorical variables. A Stanford study measured this at 12.3 hours and 878 lines of code per prediction task for experienced data scientists.
Here is what that looks like in practice. A telecom company wants to predict churn. The raw data lives in three tables:
subscribers
| subscriber_id | plan | tenure_months | monthly_charge |
|---|---|---|---|
| S-4001 | Unlimited | 34 | $89 |
| S-4002 | Basic | 8 | $45 |
| S-4003 | Family | 22 | $120 |
usage_events
| event_id | subscriber_id | event_type | timestamp | data_mb |
|---|---|---|---|---|
| E-001 | S-4001 | Data | 2025-03-01 08:14 | 142 |
| E-002 | S-4001 | Voice | 2025-03-01 09:30 | — |
| E-003 | S-4002 | Data | 2025-03-01 11:45 | 8 |
| E-004 | S-4002 | Data | 2025-02-15 14:20 | 12 |
| E-005 | S-4003 | Data | 2025-03-02 10:00 | 340 |
S-4002 has only 2 usage events in 2 weeks, with minimal data consumption. S-4001 uses data and voice on the same day. These temporal patterns matter.
support_interactions
| ticket_id | subscriber_id | category | date | resolution |
|---|---|---|---|---|
| T-101 | S-4002 | Coverage complaint | 2025-02-20 | Unresolved |
| T-102 | S-4002 | Cancel request | 2025-03-01 | Pending |
| T-103 | S-4003 | Billing question | 2025-02-15 | Resolved |
S-4002 escalated from a coverage complaint to a cancellation request in 9 days. This sequence is the strongest churn signal in the data.
Stage 2 flattens these three tables into a single row per subscriber:
flat_feature_table (what the model sees after 12.3 hours of engineering)
| subscriber_id | usage_events_30d | avg_data_mb | ticket_count | tenure | churned |
|---|---|---|---|---|---|
| S-4001 | 2 | 142 | 0 | 34 | ? |
| S-4002 | 2 | 10 | 2 | 8 | ? |
| S-4003 | 1 | 340 | 1 | 22 | ? |
The flattened table compresses ticket categories into a count. S-4002's escalation from complaint to cancellation is invisible. S-4003's resolved billing question looks identical to an unresolved cancellation.
Stage 3: Feature storage and serving
Store computed features for training and serve them at prediction time. Feature stores (Tecton, Feast, Hopsworks) manage this but add infrastructure complexity: online stores for low-latency serving, offline stores for training, materialization jobs to keep them in sync, and TTL policies to manage stale data.
Stage 4: Model training
Train the model on the feature table. This includes hyperparameter tuning, cross-validation, experiment tracking (MLflow, Weights & Biases), and compute management (GPU allocation, distributed training). This stage is what most people think of as "ML," but it represents only 10-15% of the total pipeline effort.
Stage 5: Model validation
Validate the model against held-out data, check for data leakage, test fairness across protected groups, compare against the current production model, and generate documentation for model governance. At regulated companies (finance, healthcare), this stage alone can take weeks.
Stage 6: Deployment
Serve the model in production: containerize (Docker), deploy (Kubernetes, SageMaker, Vertex AI), set up auto-scaling, configure A/B testing, and integrate with the application layer. The model is now live, but the work is not done.
Stage 7: Monitoring
Monitor for data drift (input distributions changing), model decay (accuracy degrading over time), feature freshness (stale data in the feature store), and infrastructure health (latency, errors). Set up alerting for each.
Stage 8: Retraining
When monitoring detects decay, retrain the model. This triggers stages 1-6 again: re-extract data, re-compute features, retrain, revalidate, redeploy. Most organizations retrain monthly or quarterly. Some retrain weekly. Each retraining cycle requires human oversight.
Where the complexity actually lives
Not all pipeline stages are equally complex. The distribution of effort is highly skewed.
High-complexity stages (85% of effort)
- Feature engineering: 878 lines of SQL/Python per task
- Feature storage: Online/offline sync, materialization jobs
- Data extraction: Schema monitoring, incremental loads
- Monitoring: Drift detection, freshness checks, alerting
- Retraining: Full pipeline re-execution monthly/quarterly
Low-complexity stages (15% of effort)
- Model training: 50-100 lines, well-tooled with AutoML
- Validation: Standardized metrics, automated testing
- Deployment: Containerized, one-click with modern platforms
- Hyperparameter tuning: Automated with Optuna/Ray Tune
- Experiment tracking: Mature tools (MLflow, W&B)
The pattern is clear: the stages that are hard are the ones that deal with the gap between relational data and flat features. Data extraction, feature engineering, feature storage, and the monitoring and retraining that keeps them running. The stages that are easy are the ones that operate on the flat feature table after it exists: training, tuning, validation.
The ML industry has spent a decade building better tools for the easy stages (AutoML, experiment tracking, model serving platforms) while leaving the hard stages manual. This is like optimizing the last mile of a marathon while ignoring the first 25 miles.
pipeline_cost_comparison — traditional vs foundation model
| Cost Category | Traditional (10 use cases) | Foundation Model (10 use cases) |
|---|---|---|
| Data Science Team | $2M-4M/year | $200K-400K/year |
| Infrastructure & Tooling | $1M-3M/year | $100K-300K/year |
| Feature Store Licensing | $100K-300K/year | $0 |
| Monitoring & MLOps | $200K-500K/year | Included |
| Annual Maintenance | $3M-10M/year | $50K-100K/year |
| Total (Year 1) | $5M-20M | $500K-1.5M |
| Total (3-Year) | $16M-50M | $1.5M-4.5M |
Foundation model approach reduces total 3-year cost by 75-90% by eliminating feature pipelines and per-use-case engineering.
What a simplified pipeline looks like
If you eliminate the gap between relational data and flat features, four of the eight stages disappear entirely.
Data extraction: eliminated. The model connects directly to the relational database. No extraction pipeline, no schema monitoring, no incremental load logic.
Feature engineering: eliminated. The model learns directly from raw relational data. No SQL joins, no aggregation queries, no time-window features.
Feature storage: eliminated. There are no computed features to store, serve, or keep fresh. The model reads the data at prediction time.
Retraining: eliminated (for zero-shot) or simplified (for fine-tuning). A foundation model that is pre-trained on relational patterns does not need task-specific training for many use cases. For cases where fine-tuning improves accuracy, the retraining cycle is minutes, not weeks.
What remains: connect the database, write a predictive query, validate the output, deploy the predictions. Four stages that take hours instead of months.
The foundation model approach
KumoRFM implements this simplified pipeline. The model is pre-trained on billions of relational patterns across thousands of databases. It has already learned the universal patterns that predict outcomes in relational data: recency, frequency, temporal dynamics, graph topology, cross-table signal propagation.
The production workflow is:
1. Connect. Point KumoRFM at your Snowflake, Databricks, BigQuery, or PostgreSQL database. The model reads the schema and maps tables and relationships automatically.
2. Query. Write a one-line predictive query:
PQL Query
PREDICT churn_90d FOR EACH customers.customer_id
This single query replaces stages 2-4 of the traditional pipeline: feature engineering, feature storage, and model training. The foundation model handles all three internally.
Output
| customer_id | churn_90d | confidence |
|---|---|---|
| C-10042 | 0.82 | 0.93 |
| C-10043 | 0.14 | 0.91 |
| C-10044 | 0.67 | 0.88 |
| C-10045 | 0.03 | 0.96 |
3. Deploy. Predictions are available via API or written back to your data warehouse. No model serving infrastructure. No containerization. No Kubernetes.
The entire pipeline, from connected database to production predictions, is measured in minutes. There is no feature pipeline to maintain. No retraining schedule to manage. No drift to monitor in a feature store that does not exist. When the underlying data changes, the model adapts automatically because it reads the data at prediction time.
What this means for ML teams
The implication is not that ML teams become unnecessary. It is that they spend their time differently. Instead of 60% on pipeline maintenance and 40% on new work, the ratio flips.
More predictions, faster. A team that previously shipped 3-4 new models per year (each taking 3-6 months to build and deploy) can now ship 50+ predictions per quarter. Each new prediction question is a query, not a project.
Higher-value work. Data scientists spend time on business problem framing, result interpretation, and stakeholder communication instead of writing SQL joins and debugging feature pipelines.
Lower infrastructure cost. Eliminating feature stores, training clusters, orchestration systems, and monitoring infrastructure reduces cloud spend by 40-70% for prediction workloads.
The complexity of ML pipelines is not inherent to machine learning. It is an artifact of the gap between relational data and flat-table models. Close the gap with a model that learns directly from relational data, and 90% of the pipeline evaporates. What remains is the part that creates value: asking the right questions and acting on the answers.