What is feature engineering in machine learning?

Feature engineering is the process of transforming raw data from one or more tables into a flat table of numerical features that a machine learning model can consume. It involves writing SQL joins, computing aggregations (counts, sums, averages over time windows), and encoding categorical variables. For enterprise data stored across multiple relational tables, this process typically consumes 80% of a data science team's time.

Why does feature engineering take so long?

Three structural reasons: (1) the combinatorial explosion of possible features across multiple tables, aggregation functions, and time windows; (2) multi-hop relationships that span 3-4 tables and are invisible without graph traversal; and (3) temporal patterns that are destroyed by aggregation. A Stanford study measured this at 12.3 hours and 878 lines of code per prediction task for experienced data scientists.

Can AutoML replace feature engineering?

No. AutoML tools like DataRobot and H2O automate model selection and hyperparameter tuning, but they still require a pre-engineered flat feature table as input. They automate the last 20% of the pipeline while the 80% that is feature engineering remains manual.

What is the alternative to feature engineering?

Relational foundation models like KumoRFM learn directly from raw relational data without requiring any feature engineering. They represent the database as a temporal graph and use graph transformers to automatically discover predictive patterns across tables, including multi-hop relationships and temporal sequences that humans would never enumerate manually.

How does a foundation model approach compare to manual feature engineering on accuracy?

On the RelBench benchmark (7 databases, 30 tasks, 103 million rows), KumoRFM zero-shot outperforms manual feature engineering by a Stanford-trained data scientist on 11 of 12 classification tasks, with an average AUROC of 76.71 vs 75.83 for the supervised GNN baseline and 62.44 for LightGBM with manual features. Fine-tuning pushes this to 81.14 AUROC.

Feature Engineering: Why It Takes 80% of Your Time and How to Skip It | Kumo.ai

Here is a stat that should make every VP of Data Science uncomfortable. A Stanford-run study gave experienced data scientists a straightforward prediction task on a relational database. They had full access to the data, modern tools, and no time pressure. On average, each task took 12.3 hours and 878 lines of code. And 80% of that time was spent on one thing: feature engineering.

Not modeling. Not tuning hyperparameters. Not debugging infrastructure. Just figuring out which columns from which tables to combine, how to aggregate them, and what time windows to use. Over and over again, for every single prediction task.

This is not a tooling problem. It is a structural problem with how machine learning consumes data. And until you understand why, you will keep throwing people at it.

What feature engineering actually is

Feature engineering is the process of transforming raw data into inputs that a machine learning model can use. If you have a customers table and a transactions table and you want to predict churn, the model cannot just read both tables. It needs a single row per customer with columns like total_spend_last_90d, num_orders_last_30d, avg_order_value, days_since_last_order.

You are converting a rich, multi-table relational structure into a flat spreadsheet. One row per entity, every signal compressed into a number.

For simple cases, this is fine. Two tables, a handful of aggregations, a clear target variable. But enterprise databases are not simple. A typical production database has 10 to 50 interconnected tables. The RelBench benchmark, which is the standard for evaluating ML on relational databases, includes datasets ranging from 3 tables (Amazon product data) to 15 tables (clinical trial data) with up to 41 million rows.

Each join path is a decision. Each aggregation is a decision. Each time window is a decision. And every combination you do not try is a pattern you will never find.

A concrete example: follow the data

Let's make this tangible. Here is a simplified e-commerce database with three tables. The goal: predict which customers will churn in the next 30 days.

customers

customer_id	name	signup_date	plan
C001	Sarah Chen	2024-01-15	Premium
C002	James Wilson	2023-06-20	Basic
C003	Maria Lopez	2024-03-01	Premium
C004	David Kim	2023-11-10	Basic

orders

order_id	customer_id	product_id	amount	date
O101	C001	P10	$89.00	2025-01-05
O102	C001	P22	$34.50	2025-02-12
O103	C002	P10	$89.00	2024-09-14
O104	C003	P15	$120.00	2025-02-28
O105	C003	P22	$34.50	2025-03-01
O106	C003	P10	$89.00	2025-03-10
O107	C004	P15	$120.00	2024-12-01

Notice: C002 hasn't ordered in 6 months. C003 has 3 orders in 10 days. These temporal patterns matter.

support_tickets

ticket_id	customer_id	category	date	resolved
T201	C002	Billing	2024-10-03	Yes
T202	C002	Cancellation	2025-01-15	No
T203	C004	Product Issue	2025-01-20	Yes
T204	C004	Product Issue	2025-02-05	No

Highlighted rows: unresolved tickets are strong churn signals, especially 'Cancellation' type.

What feature engineering does to this data

To use XGBoost, you need to flatten these three tables into a single row per customer. A data scientist might write 50-100 lines of SQL to produce this:

feature_table (flattened)

customer_id	order_count_30d	avg_order_value	days_since_last_order	ticket_count	plan
C001	0	$61.75	30	0	Premium
C002	0	$89.00	178	2	Basic
C003	3	$81.17	4	0	Premium
C004	0	$120.00	103	2	Basic

This is what XGBoost sees. Notice what's missing.

Looks reasonable. But look at what got lost:

C002 filed a “Cancellation” ticket that is still unresolved. The flat table shows “ticket_count = 2” but does not distinguish between a billing question and an active cancellation request. That distinction is the difference between a routine inquiry and an imminent churn.
C004's tickets are accelerating. Two product issues in 16 days, the second one unresolved. The flat table shows “ticket_count = 2” identical to C002. But the temporal pattern (accelerating complaints) carries different signal than a billing question followed by a cancellation.
C003's order burst. Three orders in 10 days. Is this a power user or someone making returns? The flat table says “order_count_30d = 3” with no sequence information.

What a foundation model sees instead

KumoRFM does not flatten this data. It represents it as a graph: customers are nodes, orders are nodes, support tickets are nodes. Foreign keys (customer_id) become edges. The model traverses the full graph structure, attending to ticket categories, temporal sequences, and cross-table patterns simultaneously.

PQL Query

PREDICT COUNT(orders.*, 0, 30) > 0
FOR EACH customers.customer_id

This single line replaces the entire feature engineering pipeline. The model automatically discovers which cross-table patterns predict churn.

Output

customer_id	churn_probability	top_signal
C001	0.23	Order recency declining
C002	0.91	Unresolved cancellation ticket
C003	0.08	High recent engagement
C004	0.67	Accelerating support tickets

The model correctly identifies C002 as highest risk (unresolved cancellation) and C004 as elevated risk (accelerating complaints). These signals existed in the raw data. Feature engineering destroyed them by compressing ticket category and temporal sequence into a single count.

Why it takes so long

The time cost of feature engineering is not about writing SQL. Any data scientist can write a JOIN. The cost comes from three structural problems that get exponentially worse as your database gets more complex.

1. The combinatorial explosion

Consider a database with 5 tables connected by foreign keys. For each prediction target, you need to decide: which tables to join, which columns to aggregate, what aggregation function to use (sum, count, average, max, min, distinct count), and what time window to apply (7 days, 30 days, 90 days, all time).

feature_space_explosion — 5-table e-commerce database

Dimension	Options	Count
Tables to join (from customers)	orders, products, reviews, support, sessions	5
Join depth (hops)	1-hop, 2-hop, 3-hop, 4-hop	4
Numeric columns per table	amount, quantity, rating, duration, etc.	~8 avg
Aggregation functions	SUM, COUNT, AVG, MAX, MIN, DISTINCT_COUNT	6
Time windows	7d, 30d, 60d, 90d, 180d, all-time	6
Total possible 1-hop features	5 tables x 8 cols x 6 aggs x 6 windows	1,440
Total possible 2-hop features	1,440 x 5 second-hop tables x 6 aggs	43,200
Total feature space (all hops)	1-hop + 2-hop + 3-hop + 4-hop	~1.2 million

A human team explores 50-200 features out of 1.2 million possible combinations. That is less than 0.02% of the feature space. The multi-hop features (where the strongest signals often hide) are almost never explored.

A human team picks 50 to 200 of these features. They test them, discard the ones that do not help, engineer new ones based on domain intuition, and iterate. The Stanford study found that data scientists wrote an average of 878 lines of SQL and Python per task, and that number had a standard deviation of 77 lines, meaning even experienced practitioners vary widely in their approach.

2. Multi-hop relationships

The most predictive patterns often span multiple tables. A customer's churn risk might depend not just on their own purchase history, but on the return rates of the products they bought, the satisfaction scores of the brands they prefer, and the churn behavior of customers who bought similar products.

That is a 4-hop path: customer → orders → products → orders (of other customers) → customer status. No one writes this feature. Not because it is hard to write the SQL, but because no one thinks to look for it. The signal is invisible until you traverse the graph.

3. Temporal patterns that aggregates destroy

A flat feature table might say "this user placed 5 orders in the last 30 days." But it cannot tell you whether those orders were evenly spaced, all in the first week, or accelerating in frequency. It cannot tell you that the user's order value has been declining by 10% each week. It cannot tell you that their support tickets spiked right after a specific product purchase.

These temporal sequences carry strong predictive signal. Aggregating them into counts and averages destroys the pattern.

What people have tried

The industry has spent a decade trying to fix feature engineering without rethinking the underlying approach.

Feature stores

Tecton, Feast, Hopsworks. These solve the operational problem (serving features at low latency in production) but not the creation problem. Someone still has to decide what features to build. The store manages the output of feature engineering. It does not replace the process.

AutoML

DataRobot, H2O, Google AutoML. These automate model selection and hyperparameter tuning, which saves time at the modeling stage. But they still require a flat feature table as input. AutoML automates the last 20% of the pipeline. The 80% that is feature engineering remains manual.

Automated feature generation

Featuretools, tsfresh, and similar libraries generate features programmatically by applying all possible aggregations across join paths. This is closer to the right idea, but it creates a different problem: feature explosion. You get thousands of features, most of which are noise, and you need another round of feature selection to prune them down. The generated features are also limited to pre-programmed aggregation patterns. They cannot learn new types of relationships.

LLMs on tables

Serializing tables as JSON or CSV and feeding them to a large language model. The RelBench benchmark tested this directly using Llama 3.2 3B. The result: 68.06 AUROC on classification tasks, compared to 75.83 for a graph neural network that learns directly from the relational structure. LLMs were built to predict the next token, not to find patterns in numerical relational data.

Traditional approach

Data scientist writes SQL joins across 5-15 tables
Engineers 100-500 aggregate features manually
Trains model on flat feature table
Repeats from scratch for every new question
12.3 hours, 878 lines of code per task

Foundation model approach

Connect your relational database directly
Model learns patterns from raw table structure
Multi-hop, temporal, and graph patterns captured automatically
Same model handles any prediction task
1 second, 1 line of PQL

Why this problem is finally solvable

The breakthrough is not a better way to generate features. It is skipping feature engineering entirely by building models that learn directly from relational data.

Relational Deep Learning, published at ICML 2024 by researchers at Stanford and Kumo.ai, showed that you can represent a relational database as a temporal heterogeneous graph (rows become nodes, foreign keys become edges) and train a graph neural network directly on that structure. The GNN learns which cross-table patterns are predictive without any human specifying features.

On the RelBench benchmark (7 databases, 30 tasks, 103 million rows), this approach outperformed manual feature engineering on 11 of 12 classification tasks, despite the data scientists having unlimited time and full domain knowledge.

KumoRFM takes this further. It is a foundation model pre-trained on billions of relational patterns across thousands of diverse databases. Like GPT for text, it has already learned the universal patterns that recur across relational data: recency, frequency, temporal dynamics, graph topology, cross-table propagation. At inference time, you point it at your database and ask a question. No training, no feature engineering, no pipeline.

The time to first prediction drops from 12.3 hours to under 1 second. The lines of code drop from 878 to 1. And the accuracy is higher, because the model explores the full feature space that humans cannot enumerate.

What this means in practice

DoorDash used this approach and saw a 1.8% engagement lift across 30 million users. Databricks saw a 5.4x conversion lift. Snowflake saw a 3.2x expansion revenue lift.

These are not toy benchmarks. These are production deployments on enterprise-scale relational databases, running predictions that would have taken data science teams months to build manually.

The shift is not from manual to automated feature engineering. It is from feature engineering to no feature engineering. The model does not generate features and then train on them. It learns directly from the relational structure.

If your data science team spends most of their time writing SQL joins and aggregate queries, that time is not adding value. It is bridging a gap between how data is stored (relational tables) and how models consume data (flat tables). Close that gap, and the entire ML pipeline collapses from months to minutes.

Key Takeaways

1Feature engineering consumes 80% of data science time: 12.3 hours and 878 lines of code per prediction task, measured by Stanford researchers.
2The bottleneck is not speed but coverage. Humans explore 50-200 features out of thousands of possible combinations. Multi-hop relationships, temporal sequences, and cross-table interactions are systematically missed.
3Flattening relational data into feature tables destroys information. Ticket categories become counts. Order sequences become averages. Graph topology disappears entirely.
4AutoML, feature stores, and automated feature generation all require a flat feature table as input. They automate the last 20% of the pipeline while leaving the 80% manual.
5Relational foundation models like KumoRFM skip feature engineering entirely by learning directly from the graph structure of relational databases. Time to first prediction: under 1 second. Lines of code: 1.

Feature Engineering: Why It Takes 80% of Your Time and How to Skip It