Berlin Tech Meetup: The Future of Relational Foundation Models, Systems, and Real-World Applications

Register now:
Learn12 min read

Feature Engineering: Why It Takes 80% of Your Time and How to Skip It

Everyone knows feature engineering is a bottleneck. But few people can explain exactly why. Here's the structural reason it's so hard, what the alternatives actually look like, and why the problem is finally solvable.

TL;DR

  • 1Feature engineering consumes 80% of data science time. A Stanford study measured it: 12.3 hours and 878 lines of code per prediction task.
  • 2The bottleneck is not writing SQL. It is the combinatorial explosion of possible features across multiple tables, time windows, and aggregation functions. Humans explore less than 5% of the feature space.
  • 3Flattening relational data destroys three types of signal: multi-hop relationships (customer → product → category → other customers), temporal sequences (order acceleration vs deceleration), and graph topology (fraud ring structure).
  • 4AutoML, feature stores, and LLMs on tables all still require a flat feature table as input. They automate the last 20% of the pipeline.
  • 5Relational foundation models skip feature engineering entirely. KumoRFM reads raw relational tables, discovers cross-table patterns automatically, and delivers predictions in under 1 second with 1 line of PQL.

Here is a stat that should make every VP of Data Science uncomfortable. A Stanford-run study gave experienced data scientists a straightforward prediction task on a relational database. They had full access to the data, modern tools, and no time pressure. On average, each task took 12.3 hours and 878 lines of code. And 80% of that time was spent on one thing: feature engineering.

Not modeling. Not tuning hyperparameters. Not debugging infrastructure. Just figuring out which columns from which tables to combine, how to aggregate them, and what time windows to use. Over and over again, for every single prediction task.

This is not a tooling problem. It is a structural problem with how machine learning consumes data. And until you understand why, you will keep throwing people at it.

What feature engineering actually is

Feature engineering is the process of transforming raw data into inputs that a machine learning model can use. If you have a customers table and a transactions table and you want to predict churn, the model cannot just read both tables. It needs a single row per customer with columns like total_spend_last_90d, num_orders_last_30d, avg_order_value, days_since_last_order.

You are converting a rich, multi-table relational structure into a flat spreadsheet. One row per entity, every signal compressed into a number.

For simple cases, this is fine. Two tables, a handful of aggregations, a clear target variable. But enterprise databases are not simple. A typical production database has 10 to 50 interconnected tables. The RelBench benchmark, which is the standard for evaluating ML on relational databases, includes datasets ranging from 3 tables (Amazon product data) to 15 tables (clinical trial data) with up to 41 million rows.

Each join path is a decision. Each aggregation is a decision. Each time window is a decision. And every combination you do not try is a pattern you will never find.

A concrete example: follow the data

Let's make this tangible. Here is a simplified e-commerce database with three tables. The goal: predict which customers will churn in the next 30 days.

customers

customer_idnamesignup_dateplan
C001Sarah Chen2024-01-15Premium
C002James Wilson2023-06-20Basic
C003Maria Lopez2024-03-01Premium
C004David Kim2023-11-10Basic

orders

order_idcustomer_idproduct_idamountdate
O101C001P10$89.002025-01-05
O102C001P22$34.502025-02-12
O103C002P10$89.002024-09-14
O104C003P15$120.002025-02-28
O105C003P22$34.502025-03-01
O106C003P10$89.002025-03-10
O107C004P15$120.002024-12-01

Notice: C002 hasn't ordered in 6 months. C003 has 3 orders in 10 days. These temporal patterns matter.

support_tickets

ticket_idcustomer_idcategorydateresolved
T201C002Billing2024-10-03Yes
T202C002Cancellation2025-01-15No
T203C004Product Issue2025-01-20Yes
T204C004Product Issue2025-02-05No

Highlighted rows: unresolved tickets are strong churn signals, especially 'Cancellation' type.

What feature engineering does to this data

To use XGBoost, you need to flatten these three tables into a single row per customer. A data scientist might write 50-100 lines of SQL to produce this:

feature_table (flattened)

customer_idorder_count_30davg_order_valuedays_since_last_orderticket_countplan
C0010$61.75300Premium
C0020$89.001782Basic
C0033$81.1740Premium
C0040$120.001032Basic

This is what XGBoost sees. Notice what's missing.

Looks reasonable. But look at what got lost:

  • C002 filed a “Cancellation” ticket that is still unresolved. The flat table shows “ticket_count = 2” but does not distinguish between a billing question and an active cancellation request. That distinction is the difference between a routine inquiry and an imminent churn.
  • C004's tickets are accelerating. Two product issues in 16 days, the second one unresolved. The flat table shows “ticket_count = 2” identical to C002. But the temporal pattern (accelerating complaints) carries different signal than a billing question followed by a cancellation.
  • C003's order burst. Three orders in 10 days. Is this a power user or someone making returns? The flat table says “order_count_30d = 3” with no sequence information.

What a foundation model sees instead

KumoRFM does not flatten this data. It represents it as a graph: customers are nodes, orders are nodes, support tickets are nodes. Foreign keys (customer_id) become edges. The model traverses the full graph structure, attending to ticket categories, temporal sequences, and cross-table patterns simultaneously.

PQL Query

PREDICT COUNT(orders.*, 0, 30) > 0
FOR EACH customers.customer_id

This single line replaces the entire feature engineering pipeline. The model automatically discovers which cross-table patterns predict churn.

Output

customer_idchurn_probabilitytop_signal
C0010.23Order recency declining
C0020.91Unresolved cancellation ticket
C0030.08High recent engagement
C0040.67Accelerating support tickets

The model correctly identifies C002 as highest risk (unresolved cancellation) and C004 as elevated risk (accelerating complaints). These signals existed in the raw data. Feature engineering destroyed them by compressing ticket category and temporal sequence into a single count.

Why it takes so long

The time cost of feature engineering is not about writing SQL. Any data scientist can write a JOIN. The cost comes from three structural problems that get exponentially worse as your database gets more complex.

1. The combinatorial explosion

Consider a database with 5 tables connected by foreign keys. For each prediction target, you need to decide: which tables to join, which columns to aggregate, what aggregation function to use (sum, count, average, max, min, distinct count), and what time window to apply (7 days, 30 days, 90 days, all time).

feature_space_explosion — 5-table e-commerce database

DimensionOptionsCount
Tables to join (from customers)orders, products, reviews, support, sessions5
Join depth (hops)1-hop, 2-hop, 3-hop, 4-hop4
Numeric columns per tableamount, quantity, rating, duration, etc.~8 avg
Aggregation functionsSUM, COUNT, AVG, MAX, MIN, DISTINCT_COUNT6
Time windows7d, 30d, 60d, 90d, 180d, all-time6
Total possible 1-hop features5 tables x 8 cols x 6 aggs x 6 windows1,440
Total possible 2-hop features1,440 x 5 second-hop tables x 6 aggs43,200
Total feature space (all hops)1-hop + 2-hop + 3-hop + 4-hop~1.2 million

A human team explores 50-200 features out of 1.2 million possible combinations. That is less than 0.02% of the feature space. The multi-hop features (where the strongest signals often hide) are almost never explored.

A human team picks 50 to 200 of these features. They test them, discard the ones that do not help, engineer new ones based on domain intuition, and iterate. The Stanford study found that data scientists wrote an average of 878 lines of SQL and Python per task, and that number had a standard deviation of 77 lines, meaning even experienced practitioners vary widely in their approach.

2. Multi-hop relationships

The most predictive patterns often span multiple tables. A customer's churn risk might depend not just on their own purchase history, but on the return rates of the products they bought, the satisfaction scores of the brands they prefer, and the churn behavior of customers who bought similar products.

That is a 4-hop path: customer → orders → products → orders (of other customers) → customer status. No one writes this feature. Not because it is hard to write the SQL, but because no one thinks to look for it. The signal is invisible until you traverse the graph.

3. Temporal patterns that aggregates destroy

A flat feature table might say "this user placed 5 orders in the last 30 days." But it cannot tell you whether those orders were evenly spaced, all in the first week, or accelerating in frequency. It cannot tell you that the user's order value has been declining by 10% each week. It cannot tell you that their support tickets spiked right after a specific product purchase.

These temporal sequences carry strong predictive signal. Aggregating them into counts and averages destroys the pattern.

What people have tried

The industry has spent a decade trying to fix feature engineering without rethinking the underlying approach.

Feature stores

Tecton, Feast, Hopsworks. These solve the operational problem (serving features at low latency in production) but not the creation problem. Someone still has to decide what features to build. The store manages the output of feature engineering. It does not replace the process.

AutoML

DataRobot, H2O, Google AutoML. These automate model selection and hyperparameter tuning, which saves time at the modeling stage. But they still require a flat feature table as input. AutoML automates the last 20% of the pipeline. The 80% that is feature engineering remains manual.

Automated feature generation

Featuretools, tsfresh, and similar libraries generate features programmatically by applying all possible aggregations across join paths. This is closer to the right idea, but it creates a different problem: feature explosion. You get thousands of features, most of which are noise, and you need another round of feature selection to prune them down. The generated features are also limited to pre-programmed aggregation patterns. They cannot learn new types of relationships.

LLMs on tables

Serializing tables as JSON or CSV and feeding them to a large language model. The RelBench benchmark tested this directly using Llama 3.2 3B. The result: 68.06 AUROC on classification tasks, compared to 75.83 for a graph neural network that learns directly from the relational structure. LLMs were built to predict the next token, not to find patterns in numerical relational data.

Traditional approach

  • Data scientist writes SQL joins across 5-15 tables
  • Engineers 100-500 aggregate features manually
  • Trains model on flat feature table
  • Repeats from scratch for every new question
  • 12.3 hours, 878 lines of code per task

Foundation model approach

  • Connect your relational database directly
  • Model learns patterns from raw table structure
  • Multi-hop, temporal, and graph patterns captured automatically
  • Same model handles any prediction task
  • 1 second, 1 line of PQL

Why this problem is finally solvable

The breakthrough is not a better way to generate features. It is skipping feature engineering entirely by building models that learn directly from relational data.

Relational Deep Learning, published at ICML 2024 by researchers at Stanford and Kumo.ai, showed that you can represent a relational database as a temporal heterogeneous graph (rows become nodes, foreign keys become edges) and train a graph neural network directly on that structure. The GNN learns which cross-table patterns are predictive without any human specifying features.

On the RelBench benchmark (7 databases, 30 tasks, 103 million rows), this approach outperformed manual feature engineering on 11 of 12 classification tasks, despite the data scientists having unlimited time and full domain knowledge.

KumoRFM takes this further. It is a foundation model pre-trained on billions of relational patterns across thousands of diverse databases. Like GPT for text, it has already learned the universal patterns that recur across relational data: recency, frequency, temporal dynamics, graph topology, cross-table propagation. At inference time, you point it at your database and ask a question. No training, no feature engineering, no pipeline.

The time to first prediction drops from 12.3 hours to under 1 second. The lines of code drop from 878 to 1. And the accuracy is higher, because the model explores the full feature space that humans cannot enumerate.

What this means in practice

DoorDash used this approach and saw a 1.8% engagement lift across 30 million users. Databricks saw a 5.4x conversion lift. Snowflake saw a 3.2x expansion revenue lift.

These are not toy benchmarks. These are production deployments on enterprise-scale relational databases, running predictions that would have taken data science teams months to build manually.

The shift is not from manual to automated feature engineering. It is from feature engineering to no feature engineering. The model does not generate features and then train on them. It learns directly from the relational structure.

If your data science team spends most of their time writing SQL joins and aggregate queries, that time is not adding value. It is bridging a gap between how data is stored (relational tables) and how models consume data (flat tables). Close that gap, and the entire ML pipeline collapses from months to minutes.

Frequently asked questions

What is feature engineering in machine learning?

Feature engineering is the process of transforming raw data from one or more tables into a flat table of numerical features that a machine learning model can consume. It involves writing SQL joins, computing aggregations (counts, sums, averages over time windows), and encoding categorical variables. For enterprise data stored across multiple relational tables, this process typically consumes 80% of a data science team's time.

Why does feature engineering take so long?

Three structural reasons: (1) the combinatorial explosion of possible features across multiple tables, aggregation functions, and time windows; (2) multi-hop relationships that span 3-4 tables and are invisible without graph traversal; and (3) temporal patterns that are destroyed by aggregation. A Stanford study measured this at 12.3 hours and 878 lines of code per prediction task for experienced data scientists.

Can AutoML replace feature engineering?

No. AutoML tools like DataRobot and H2O automate model selection and hyperparameter tuning, but they still require a pre-engineered flat feature table as input. They automate the last 20% of the pipeline while the 80% that is feature engineering remains manual.

What is the alternative to feature engineering?

Relational foundation models like KumoRFM learn directly from raw relational data without requiring any feature engineering. They represent the database as a temporal graph and use graph transformers to automatically discover predictive patterns across tables, including multi-hop relationships and temporal sequences that humans would never enumerate manually.

How does a foundation model approach compare to manual feature engineering on accuracy?

On the RelBench benchmark (7 databases, 30 tasks, 103 million rows), KumoRFM zero-shot outperforms manual feature engineering by a Stanford-trained data scientist on 11 of 12 classification tasks, with an average AUROC of 76.71 vs 75.83 for the supervised GNN baseline and 62.44 for LightGBM with manual features. Fine-tuning pushes this to 81.14 AUROC.

See it in action

KumoRFM delivers predictions on relational data in seconds. No feature engineering, no ML pipelines. Try it free.