In 2019, VentureBeat reported that 87% of data science projects never make it to production. Despite billions invested in MLOps platforms, feature stores, and infrastructure, the industry spent years building MLOps platforms, hiring ML engineers, and deploying Kubernetes clusters. The needle barely moved.
The problem was never the model. A competent data scientist can train a model that works in a Jupyter notebook in a day or two. The problem is everything that happens between "the model works in my notebook" and "the model runs in production serving 10 million predictions per day." That gap is 6 to 12 months of engineering. That gap is MLOps.
What MLOps actually involves
MLOps is the discipline of running machine learning in production. It sounds like DevOps for ML, and the analogy is correct in one important way: just as DevOps added enormous infrastructure complexity to software development, MLOps adds enormous infrastructure complexity to data science.
A typical production ML pipeline has 6 to 8 stages. Each one requires different tools, different expertise, and often a different team.
Stage 1: Data ingestion
Raw data lives in data warehouses (Snowflake, BigQuery, Redshift), operational databases (Postgres, MySQL), event streams (Kafka), and third-party APIs. An ingestion pipeline extracts data from these sources, handles schema changes, and lands it in a staging area. Tools: Airflow, dbt, Fivetran, custom Spark jobs.
Stage 2: Feature engineering
A data scientist writes SQL and Python to transform raw tables into a flat feature matrix. This stage alone consumes 80% of the total project time. A Stanford study measured it at 12.3 hours and 878 lines of code per prediction task. For a production model, multiply that by the number of iterations needed to reach acceptable accuracy.
Stage 3: Feature store
Features need to be computed consistently for training and serving. The feature store (Tecton, Feast, Hopsworks) manages this: storing historical feature values for training, serving fresh features at low latency for inference, and ensuring no data leakage between training and serving. Setting up a feature store is a platform engineering project that takes 2 to 6 months.
Stage 4: Model training
Train the model on historical data. Tune hyperparameters. Run cross-validation. Compare model architectures. This is what most people think of as "doing ML." It is typically 10-15% of the total project time. Tools: PyTorch, TensorFlow, XGBoost, SageMaker Training, Vertex AI Training.
Stage 5: Model validation and registry
Before deployment, the model goes through validation: checking for bias, verifying performance on holdout data, running A/B tests against the current production model, getting sign-off from the model risk team. Approved models are versioned in a model registry (MLflow, Weights & Biases). In regulated industries, this stage can take weeks.
Stage 6: Serving infrastructure
The model needs to run somewhere that can handle production traffic. Batch predictions might run on Spark. Real-time predictions need a low-latency serving endpoint (SageMaker Endpoints, Vertex AI Prediction, Seldon, BentoML). The serving layer needs to fetch fresh features from the feature store, run inference, and return predictions within an SLA (often under 100ms).
Stage 7: Monitoring
Models degrade over time as the data distribution shifts. A monitoring system tracks prediction quality, detects data drift and concept drift, and alerts the team when the model needs retraining. Tools: Evidently, WhyLabs, Arize, custom dashboards.
Stage 8: Retraining
When model performance degrades, the entire pipeline runs again: ingest fresh data, recompute features, retrain the model, validate, deploy. Automated retraining pipelines are the holy grail of MLOps and the hardest thing to get right, because any change in the data schema or feature logic can break the pipeline silently.
Here is what the pipeline looks like in practice. A fintech company tracks every stage of their churn prediction model.
pipeline_runs
| run_id | model | stage | started | duration | status |
|---|---|---|---|---|---|
| RUN-101 | churn_v3 | Data Ingestion | 2025-09-01 | 2h 14m | Success |
| RUN-101 | churn_v3 | Feature Engineering | 2025-09-01 | 18h 42m | Success |
| RUN-101 | churn_v3 | Feature Store Sync | 2025-09-02 | 3h 08m | Success |
| RUN-101 | churn_v3 | Model Training | 2025-09-02 | 1h 23m | Success |
| RUN-101 | churn_v3 | Validation | 2025-09-03 | 4d (waiting) | Pending |
| RUN-101 | churn_v3 | Deployment | --- | --- | Blocked |
Highlighted: feature engineering took 18+ hours (the longest compute stage). Validation has been pending 4 days waiting for model risk review. Deployment is blocked.
pipeline_costs
| stage | compute_cost | labor_hours | labor_cost | total |
|---|---|---|---|---|
| Data Ingestion | $42 | 4h | $600 | $642 |
| Feature Engineering | $187 | 40h | $6,000 | $6,187 |
| Feature Store | $95 | 16h | $2,400 | $2,495 |
| Training | $124 | 8h | $1,200 | $1,324 |
| Validation | $0 | 24h | $3,600 | $3,600 |
| Deployment | $68 | 12h | $1,800 | $1,868 |
Feature engineering dominates both compute and labor cost. The total pipeline cost for one model version: $16,116. For a team running 4-5 models, this repeats quarterly.
pipeline_failures (last 6 months)
| date | model | stage | failure_reason | time_lost |
|---|---|---|---|---|
| 2025-07-14 | churn_v2 | Feature Eng. | Schema change in payments table | 3 days |
| 2025-08-02 | fraud_v4 | Feature Store | Feast version incompatibility | 2 days |
| 2025-08-19 | ltv_v1 | Training | GPU OOM on new feature set | 1 day |
| 2025-09-05 | churn_v3 | Deployment | Serving latency exceeded SLA | 4 days |
Highlighted: an upstream schema change broke the feature pipeline for 3 days. A serving latency issue blocked deployment for 4 days. These cascading failures are the norm.
Why the pipeline kills projects
The pipeline does not fail in one place. It fails everywhere, slowly.
The dependency chain
Each stage depends on the output of the previous stage. Here is a concrete trace of one schema change cascading through the pipeline.
cascade_impact: payments table renamed 'amount' to 'payment_amount'
| stage | impact | fix_required | time_to_fix |
|---|---|---|---|
| Feature Engineering | SQL query referencing payments.amount fails | Update 14 SQL queries | 4 hours |
| Feature Store | Feature 'avg_payment_amount_30d' stops computing | Update Feast definition + backfill | 6 hours |
| Model Training | Training job fails: expected column missing | Retrigger after feature store fix | 2 hours |
| Serving | Live predictions return stale features from cache | Flush cache + redeploy | 3 hours |
| Monitoring | Data drift alert fires on missing feature | Update monitoring config | 1 hour |
One column rename in one upstream table cascaded through 5 pipeline stages, requiring 16 hours of combined engineering time across 3 different teams. This is a routine schema change.
If the feature logic changes, the feature store needs updating. If the feature store schema changes, the serving layer breaks. A single upstream change can cascade through 4 stages of pipeline code. Teams that have been through this call it “pipeline debt.”
The iteration penalty
When the model does not perform well enough, the data scientist goes back to Stage 2 and engineers more features. Each iteration takes days. After 3 to 5 iterations, the project is months old and has consumed significant engineering time. Many projects are cancelled at this point, not because the approach was wrong, but because the timeline exceeded the business window.
The last mile problem
A model that works in a notebook is not a model that works in production. The gap between "it works on my laptop" and "it serves predictions reliably at scale" is where most projects die. The model needs containerization, API wrapping, load balancing, failover handling, latency optimization, and integration with the feature store. This is pure infrastructure engineering with zero data science value-add.
Traditional MLOps
- 6-8 pipeline stages, each requiring different tools
- 5 teams with 5 different backlogs and priorities
- Feature engineering: 12.3 hours per task, repeated per iteration
- Feature store setup: 2-6 months of platform engineering
- Time to production: 6-12 months per model
- Retraining requires re-running the entire pipeline
Foundation model approach
- One inference call replaces 6 pipeline stages
- One interface (PQL) for any prediction task
- No feature engineering: model reads raw tables directly
- No feature store: no features to store
- Time to production: minutes (write a PQL query)
- No retraining: foundation model is pre-trained and updated centrally
The tools people buy (and the problem they do not solve)
The MLOps tool landscape is enormous. MLflow, Kubeflow, SageMaker, Vertex AI, Databricks ML, Tecton, Feast, Weights & Biases, Seldon, BentoML, Evidently, Arize. Each tool solves one stage of the pipeline. None of them eliminate the pipeline.
This is the core issue. The MLOps ecosystem optimizes a fundamentally over-engineered process instead of questioning whether the process should exist. Every dollar spent on a feature store is a dollar spent managing the output of feature engineering. If you eliminate feature engineering, the feature store is unnecessary. Every hour spent building retraining pipelines is an hour spent re-running a process that a foundation model makes irrelevant.
The MLOps industry is worth an estimated $2.4 billion in 2024, projected to reach $13.3 billion by 2030. Most of that spend is managing complexity that foundation models remove.
How foundation models collapse the stack
A relational foundation model like KumoRFM eliminates most of the MLOps pipeline by removing the stages that create the complexity.
Feature engineering: eliminated. The model reads raw relational tables directly. It represents the database as a temporal graph and learns predictive patterns from the graph structure. No SQL joins. No aggregations. No feature iteration cycles.
Feature store: eliminated. With no engineered features, there is nothing to store, version, or serve separately. The model computes everything it needs from raw data at inference time.
Training pipeline: eliminated (for most use cases). The model is pre-trained on billions of relational patterns across thousands of databases. For a new prediction task, you do not train a new model. You write a PQL query and run zero-shot inference. If you need higher accuracy on a specific task, fine-tuning takes minutes, not months.
Model registry: simplified. Instead of managing dozens of task-specific models, each with its own version history and dependencies, you have one foundation model. Version management happens at the platform level, not the project level.
Serving infrastructure: abstracted. The foundation model platform handles serving, scaling, and latency optimization. Your team writes PQL queries. The platform runs them.
PQL Query
PREDICT COUNT(sessions.*, 0, 30) = 0 FOR EACH users.user_id
This single line replaces the entire 6-stage pipeline shown above. No data ingestion pipeline, no feature engineering, no feature store, no training job, no model registry. The foundation model reads the raw tables and returns predictions.
Output
| user_id | churn_probability | time_to_predict |
|---|---|---|
| U-1001 | 0.73 | 0.8s |
| U-1002 | 0.12 | 0.8s |
| U-1003 | 0.91 | 0.8s |
| U-1004 | 0.44 | 0.8s |
PQL: the interface that replaces the pipeline
Predictive Query Language (PQL) is to ML what SQL was to data retrieval. SQL meant you no longer had to write custom code to read data from disk. PQL means you no longer have to build custom pipelines to generate predictions from data.
A PQL query looks like this: "For each customer, what is the probability of churn in the next 30 days?" The foundation model translates this into a graph traversal over the relational data, runs inference, and returns predictions with explanations. One line. One second. No pipeline.
The shift is from imperative (build a pipeline that produces predictions) to declarative (ask a question and get predictions). The same shift that SQL brought to data access in the 1970s.
What stays
Foundation models do not eliminate all operational concerns. Data quality still matters. Garbage in, garbage out, regardless of how sophisticated the model is. Access controls, data governance, and compliance requirements remain. Monitoring prediction quality over time is still necessary, though the mechanism changes: instead of monitoring model drift, you monitor data quality and prediction calibration.
The difference is the scale of the operational burden. Instead of managing a 6-to-8-stage pipeline with 5 teams, you manage a data connection and a query interface. The operational complexity drops by an order of magnitude.
If your organization has spent the last three years building an MLOps platform, the investment was not wasted. The data engineering foundations (clean pipelines, governed data, reliable infrastructure) transfer directly. What changes is everything downstream of the raw data: the feature engineering, the feature store, the training pipeline, the model registry, the serving layer. All of that collapses into a single foundation model query.
The 87% of models that never reach production are not failing because data scientists cannot build good models. They are failing because the pipeline between "good model" and "production model" is too long, too fragile, and too expensive. Remove the pipeline, and the 87% starts to look very different.