The real reason feature engineering takes so long
If you have spent time in enterprise ML, you already know the statistic: feature engineering consumes roughly 80% of data science project time. But the usual explanation ("it is tedious") misses the structural reason it is so expensive.
The problem is relational data. A typical enterprise does not store customer behavior in one table. It stores it across 5-50 interconnected tables: customers, orders, products, interactions, support tickets, payments, subscriptions, events. Each table connects to others through foreign keys. The relationships between tables contain the most predictive signals.
But traditional ML models (XGBoost, LightGBM, random forests, neural networks) cannot read relational databases. They require a single flat table with one row per entity. So before you can train any model, you must collapse your entire relational database into that flat structure.
This is where the time goes. Not in model training. Not in hyperparameter tuning. In the flattening.
What flattening actually requires
Here is what a data scientist does for every prediction task on relational data:
- Write SQL joins across 5-15 tables with correct temporal constraints (no data leakage). For a churn prediction task, this means joining customers to orders to products to support tickets to payments, all filtered to the correct time windows. Easily 100-300 lines of SQL.
- Compute cross-table aggregations like
avg_order_value_last_90d,support_tickets_last_30d,product_return_rate_by_category. Each one is a hypothesis about what might matter. Each one requires careful implementation. - Engineer temporal features across table boundaries: purchase frequency trends, support escalation patterns, engagement velocity changes. These require window functions spanning multiple joined tables.
- Iterate 3-4 times when the first model underperforms. Go back, hypothesize new features, implement them, retrain. Each cycle takes hours.
- Maintain the pipeline in production. When schemas change, when new data sources appear, when business logic shifts, the feature pipeline breaks and must be updated.
The deeper problem: you are exploring 4-17% of the feature space
Time is not the only cost. The bigger issue is coverage.
When a data scientist builds features, they start with hypotheses: "recency of last purchase probably matters," "support ticket count probably correlates with churn," "high-value customers probably behave differently." These are educated guesses. Good ones. But guesses.
The number of possible features from a relational database grows combinatorially. Consider just the aggregation options: for each pair of tables, you can compute count, sum, average, min, max, standard deviation, and trend across dozens of columns, over multiple time windows (7 days, 30 days, 90 days, 365 days), with various filters and groupings. Add multi-hop relationships (customer → orders → products → other customers who bought the same products → their churn rates), and the space becomes enormous.
A data scientist working 12.3 hours per task explores a tiny fraction of this space. Research on automated feature generation suggests that manual approaches typically cover only 4-17% of the feasible feature space. That means 83-96% of potentially predictive patterns are never tested.
Three approaches to the feature engineering problem
The industry has developed three distinct approaches, and the differences between them matter more than most comparisons acknowledge.
1. Manual feature engineering (XGBoost + hand-crafted features)
This is the traditional approach. A data scientist writes SQL, computes aggregations, builds a flat table, and trains a model (typically XGBoost or LightGBM). It works. It has worked for years. But it costs 12.3 hours and 878 lines of code per task, explores only a fraction of the feature space, and creates brittle pipelines that require ongoing maintenance.
- Best for: Teams with strong data science talent who need full control over every feature, or regulatory environments that require every feature to be explicitly defined and auditable.
- Watch out for: Only explores 4-17% of the possible feature space. Costs 12.3 hours per task. Creates brittle pipelines that break when schemas change. Does not scale beyond a handful of prediction tasks without a large team.
2. Automated feature engineering (Featuretools, DataRobot, H2O)
Tools like Featuretools use deep feature synthesis to automatically generate features from relational data. DataRobot and H2O Driverless AI automate single-table feature generation as part of their AutoML pipelines. These tools genuinely reduce the manual effort. Featuretools can generate hundreds of features from multiple tables in minutes instead of hours.
But here is the critical point: they still produce a flat table. They automate the flattening process. The output is still one row per entity with columns representing aggregated features. The model still trains on a single table. The relational structure is still lost.
- Best for: Teams that want to speed up existing workflows without changing their approach, or organizations already invested in an AutoML platform that need broader feature coverage than manual engineering provides.
- Watch out for: Still produces a flat table as output - the relational structure is lost. Limited to predefined aggregation primitives. Cannot discover multi-hop relational patterns. Platform licensing adds $150K-$250K per year.
3. Eliminate feature engineering (KumoRFM)
KumoRFM is a relational foundation model. It does not generate features. It does not flatten tables. It reads raw relational tables connected by foreign keys and learns predictive patterns directly from the relational structure. The model ingests the tables as they exist in your data warehouse, preserves every relationship, and discovers patterns that span multiple tables and multiple hops.
This is not a faster version of feature engineering. It is a different approach entirely. No flat table is ever created. No features are ever enumerated. The model learns what matters from the raw data.
- Best for: Organizations with relational data (5-50 tables) where feature engineering is the bottleneck, teams that need to scale from 1 to 20+ prediction tasks without scaling headcount, and any situation where speed to production is a competitive advantage.
- Watch out for: Newer paradigm with less industry history than XGBoost-based workflows. If your data is genuinely single-table and already flat, the relational advantage is smaller.
three_approaches_to_feature_engineering
| dimension | Manual (XGBoost) | Automated (Featuretools/DataRobot) | Eliminated (KumoRFM) |
|---|---|---|---|
| Feature engineering effort | 12.3 hours + 878 lines of code per task | Minutes of configuration, automated generation | Zero. No features are created. |
| Data input | Hand-built flat table (SQL joins) | Relational tables (Featuretools) or flat table (DataRobot/H2O) | Raw relational tables connected by foreign keys |
| Feature space explored | 4-17% (manual hypothesis-driven) | Broader than manual, but limited to predefined primitives | Full relational structure. No enumeration needed. |
| Multi-hop patterns | Rarely. Too expensive to implement manually. | Limited. Depth restricted by computational cost. | Native. Model traverses full relational graph. |
| Output format | Flat table with one row per entity | Flat table with one row per entity | Predictions directly. No intermediate table. |
| Pipeline maintenance | High. Feature code breaks when schemas change. | Medium. Automated pipelines still need updates. | None. Model reads raw tables as they are. |
| Time to first prediction | Weeks (feature engineering + model training) | Days (setup + automated generation + training) | ~1 second (zero-shot) to minutes (fine-tuned) |
| RelBench AUROC | 62.44 | ~64-66 (AutoML + manual features) | 76.71 zero-shot, 81.14 fine-tuned |
Highlighted: the accuracy gap between automated and eliminated approaches is 10+ AUROC points. This gap comes from relational patterns that flat-table approaches cannot represent, regardless of how features are generated.
Why automation is not enough
The distinction between automating and eliminating feature engineering is the most important point in this article, so let me be direct about it.
Featuretools, DataRobot, and H2O Driverless AI are real improvements over manual feature engineering. They reduce the time from hours to minutes. They generate more features than a human would think to test. They are legitimate tools that solve a real problem.
But they still flatten. And flattening is lossy. When you collapse a customer's order history into avg_order_value = $47.30 and order_count = 12, you lose the sequence. You lose the fact that order values have been declining for three months. You lose the fact that the last two orders were returns. You lose the fact that this customer's purchase pattern matches other customers who churned.
Automated tools generate more aggregations, but they are still aggregations. They describe the relational structure using summary statistics instead of preserving it.
what_flattening_loses (churn prediction example)
| signal | available in flat table | available in relational model |
|---|---|---|
| Average order value | Yes (single number: $47.30) | Yes, plus the full trajectory over time |
| Order value trending down | Only if someone engineers a trend feature | Yes, learned automatically from the sequence |
| Support tickets increasing while purchases decrease | Only if cross-table trend is manually computed | Yes, cross-table temporal pattern detected natively |
| Similar customers churned after same pattern | No. Requires cross-entity joins rarely attempted. | Yes. Multi-hop pattern: customer > products > other customers > outcomes |
| Product category engagement shifting | Only if category-level aggregations are built | Yes. Full product interaction history preserved. |
| Account-level multi-user behavior | Aggregated to single row. Individual patterns lost. | Each user's behavior preserved with account relationships. |
Automated feature engineering tools would generate the first two or three signals. The bottom three require multi-hop relational reasoning that flat-table approaches do not attempt.
The benchmark evidence
Two independent benchmarks quantify the difference between these approaches on real enterprise data.
SAP SALT enterprise benchmark
The SAP SALT benchmark tests prediction accuracy on production-quality enterprise databases with multiple related tables. Real business analysts and data scientists attempt the same prediction tasks.
sap_salt_enterprise_benchmark
| approach | accuracy | feature_engineering_required |
|---|---|---|
| LLM + AutoML | 63% | Automated (LLM generates features, AutoML selects model) |
| PhD Data Scientist + XGBoost | 75% | Weeks of manual feature engineering by experts |
| KumoRFM (zero-shot) | 91% | None. Zero feature engineering. Zero training. |
Highlighted: KumoRFM outperforms expert data scientists by 16 percentage points with zero feature engineering and zero training time. The LLM+AutoML approach, which represents automated feature engineering, scores lowest.
The 63% score for LLM + AutoML is particularly telling. This is the automated approach: a language model generates feature engineering code, an AutoML system selects and tunes the model. It should be faster and more consistent than manual work. But it scores 12 points lower than a PhD data scientist doing it by hand, because automation without understanding produces worse features, not better ones.
KumoRFM sidesteps the problem entirely. It does not try to generate better features. It reads the relational data directly. The 91% score represents what happens when you stop summarizing relational structure and start learning from it.
Stanford RelBench benchmark
RelBench provides a standardized evaluation across 7 databases, 30 prediction tasks, and 103 million rows. It was designed specifically to test ML approaches on relational data.
relbench_benchmark_results
| approach | AUROC | feature_engineering_time | lines_of_code |
|---|---|---|---|
| LightGBM + manual features | 62.44 | 12.3 hours per task | 878 |
| AutoML + manual features | ~64-66 | reduced time per task | 878 |
| KumoRFM zero-shot | 76.71 | ~1 second | 0 |
| KumoRFM fine-tuned | 81.14 | Minutes | 0 |
Highlighted: KumoRFM zero-shot outperforms manual + AutoML approaches by 10+ AUROC points. Fine-tuned KumoRFM reaches 81.14. Zero lines of feature engineering code in both cases.
The jump from 62.44 to ~64-66 is what AutoML buys you: better model selection on the same features. The jump from ~64-66 to 76.71 is what elimination buys you: patterns that exist in the relational structure but never made it into any flat table. That second gap is 5x larger than the first.
What this looks like in practice
Traditional workflow (manual or automated)
- Identify prediction task (e.g., 90-day churn for enterprise accounts)
- Data scientist writes SQL joins across 5-15 tables (2-4 hours)
- Compute cross-table aggregations and temporal features (4-6 hours)
- Build flat feature table with one row per customer
- Train model (XGBoost/LightGBM or AutoML platform)
- Evaluate. Underperforming? Go back to step 2. Repeat 3-4 times.
- Deploy model + maintain feature pipeline ongoing
- Total: 2-6 weeks to first production prediction
KumoRFM workflow
- Connect Kumo to your data warehouse (one-time, 30 minutes)
- Write a PQL query: PREDICT churn_90d FOR EACH customer_id
- KumoRFM reads raw tables, discovers patterns, returns predictions
- No SQL joins. No aggregations. No flat table. No feature iteration.
- Time to first prediction: ~1 second (zero-shot)
- Fine-tune for task-specific accuracy: minutes, not weeks
- No feature pipeline to maintain. Ever.
- Total: minutes to first production prediction
PQL Query
PREDICT churn_90d FOR EACH customers.customer_id WHERE customers.segment = 'enterprise' AND customers.contract_value > 50000
This single PQL query replaces the entire feature engineering pipeline. No SQL joins across tables. No aggregation logic. No feature iteration cycles. KumoRFM reads the raw customers, orders, products, support_tickets, and payments tables directly and discovers the predictive patterns itself.
Output
| customer_id | churn_probability | top_signal |
|---|---|---|
| C-4401 | 0.87 | Declining order frequency + rising support escalations |
| C-4402 | 0.12 | Stable multi-department usage, recent contract expansion |
| C-4403 | 0.93 | Similar accounts churned after same engagement drop pattern |
| C-4404 | 0.08 | Increasing product adoption, 3 new integrations this month |
The cost of continuing to do feature engineering
The time cost is obvious. But the compounding costs are what make feature engineering truly expensive at scale.
annual_cost_of_feature_engineering (20 prediction tasks)
| cost_dimension | manual_approach | automated_approach | eliminated (KumoRFM) |
|---|---|---|---|
| Feature engineering labor | 246 hours ($61,500) | ~80 hours ($20,000) | 0 hours ($0) |
| Data science team for pipelines | 3-4 FTEs ($450K-$600K) | 2-3 FTEs ($300K-$450K) | 0.5 FTE ($75K) |
| Pipeline maintenance (annual) | 520 hours ($130K) | 260 hours ($65K) | 20 hours ($5K) |
| Platform/tool licensing | $0 (open-source models) | $150K-$250K (DataRobot/H2O) | $80K-$120K (Kumo) |
| Time to new prediction task | 2-6 weeks | 3-7 days | Minutes |
| Total annual cost | $650K-$800K | $535K-$785K | $80K-$120K |
Highlighted: automation reduces cost by 15-25%. Elimination reduces cost by 85%. The difference is that automation still requires data science teams for pipeline maintenance and feature iteration.
Notice that the automated approach is not dramatically cheaper than the manual approach. The tools cost $150K-$250K per year, and you still need 2-3 data scientists for the multi-table work that automation cannot handle. The savings are real but incremental.
Elimination is a step change. When there is no feature pipeline to build, maintain, or debug, the cost structure collapses. One ML engineer can operate 20 prediction tasks because the work is writing PQL queries, not maintaining SQL pipelines.
When each approach makes sense
To be direct about this: not every organization should switch to KumoRFM tomorrow.
- Manual feature engineering makes sense when your data is already in a single table, when you have a strong data science team that values full control, or when regulatory requirements demand that every feature be explicitly defined and auditable.
- Automated feature engineering (Featuretools, DataRobot) makes sense when you want to speed up existing workflows without changing your approach, when your team is already invested in an AutoML platform, or when you need the breadth of features that tools like Featuretools generate from relational data.
- Elimination (KumoRFM) makes sense when your data is relational (5-50 tables), when feature engineering is your bottleneck, when you need maximum accuracy on relational data, when you want to scale from 1 to 20+ prediction tasks without scaling your data science team, or when speed to production is a competitive advantage.