Berlin Tech Meetup: The Future of Relational Foundation Models, Systems, and Real-World Applications

Register now:
Learn13 min read

Credit Risk Modeling with ML: Why Relational Data Changes the Game

The credit scorecard was invented in the 1950s. It uses the same 10-20 variables today. Meanwhile, the data available to assess credit risk has grown by orders of magnitude. The gap between what scorecards use and what exists in the data is where risk hides.

TL;DR

  • 1Traditional credit scorecards use 10-20 variables and misclassify 15-25% of borrowers. The data available to assess credit risk has grown by orders of magnitude, but scorecards use none of it.
  • 2Transaction-level anomalies (cash advance spikes, payment timing drift, spending composition shifts) precede default events by 3-6 months. Scorecards detect deterioration 1-2 months before default.
  • 3Credit risk is inherently relational. A borrower whose employer's payroll account shows reduced activity faces elevated risk regardless of personal payment history. Scorecards treat each borrower in isolation.
  • 4ML models reduce default prediction error by 20-30% compared to logistic regression scorecards. Relational models add another 5-10% by incorporating counterparty risk and network contagion effects.
  • 5For a bank with $50B in consumer credit exposure, a 20% reduction in default prediction error translates to tens of millions in reduced credit losses annually. KumoRFM delivers this from a single PQL query.

A traditional credit scorecard uses 10 to 20 variables: outstanding balance, payment history, credit utilization, number of accounts, account age, recent inquiries. These variables are derived from credit bureau data and compressed into a single score. The model is logistic regression. The relationships are assumed to be linear. The scorecard is recalibrated annually.

This framework was built for an era when a credit bureau file was the only data available. Today, a bank has access to transaction-level data, account behavior, payment velocity, counterparty information, employment data, and real-time financial flows. The scorecard uses none of it.

The result: traditional models misclassify 15-25% of borrowers. Some are scored too high and default unexpectedly. Others are scored too low and are denied credit they would have repaid. Both errors have direct financial consequences.

borrower_scorecard_view

borrower_idFICOutilizationpayment_historyaccountsscore_risk
B-330174238%12/12 on-time4Low
B-330271852%11/12 on-time6Medium
B-330369567%10/12 on-time3Medium
B-330473141%12/12 on-time5Low
B-330570945%12/12 on-time4Low-Medium

Five borrowers scored by a traditional 20-variable scorecard. B-3301 and B-3304 both appear low risk. But the scorecard cannot see what is happening inside their transaction accounts.

transaction_behavior (last 90 days)

borrower_idspend_trendcash_advancespayment_timingpaycheck_statusmerchant_shift
B-3301Stable0Day 5 of cycleRegular bi-weeklyNone
B-3302+35% in 30 days3 ($4,200)Day 5 > Day 22 driftMissed last depositDiscretionary > essentials
B-3303Declining0Day 10 stableRegular weeklyNone
B-3304Stable0Day 8 of cycleRegular bi-weeklyNone
B-3305+60% in 14 days5 ($8,900)Day 3 > Day 28 driftIrregular since FebRestaurants > cash advances

Highlighted: B-3302 and B-3305 show classic pre-default behavioral patterns: cash advance spikes, payment timing drift, spending composition shifts. The scorecard rates both as low-medium risk.

What scorecards miss

The limitations of traditional credit scorecards are not in the math. Logistic regression is a fine algorithm. The limitations are in the data it can consume and the patterns it can represent.

Transaction-level behavior

A scorecard sees "credit utilization: 45%." It does not see that utilization spiked from 20% to 45% in two weeks, driven by a series of cash advances rather than retail purchases. It does not see that the borrower's paycheck deposits stopped three weeks ago. It does not see that spending shifted from groceries and utilities to cash advances and overdraft-protected transfers.

Transaction-level data contains behavioral signals that aggregates destroy. The velocity, composition, and timing of transactions are strong predictors of financial stress. A borrower making minimum payments on the due date every month for 12 months has a different risk profile than one making the same minimum payments but progressively later in the grace period. The scorecard sees both as "12 on-time payments."

Counterparty and network risk

Credit risk is not independent. A borrower whose primary employer is experiencing financial distress has elevated risk, regardless of their personal payment history. A borrower whose major business clients are defaulting on their own obligations faces cascading risk. A small business whose suppliers are experiencing liquidity problems may face inventory disruptions that affect revenue.

These are relational patterns: the borrower's risk depends on the risk of entities they are connected to. Scorecards treat each borrower in isolation. Relational models propagate risk through the network.

Temporal dynamics

A borrower with a 720 FICO score and deteriorating transaction patterns has a different risk profile than a borrower with a 720 FICO score and stable patterns. The score is a point-in-time snapshot. It does not capture the trajectory. By the time the deterioration shows up in the scorecard variables (missed payments, increased utilization), the risk event may already be underway.

How ML improves credit risk models

ML-based credit risk models address the limitations of scorecards in three ways: more variables, non-linear patterns, and transaction-level granularity.

Gradient boosted trees on engineered features

The most common ML approach today: data scientists extract hundreds of features from transaction data, account history, and bureau files, then train XGBoost or LightGBM. This typically reduces default prediction error by 20-30%compared to logistic regression scorecards, according to multiple published studies including research from the Bank of England and the European Central Bank.

The limitation is the feature engineering bottleneck. Someone has to decide which transaction aggregates to compute, which time windows to use, and which cross-table features to construct. For a bank with transaction data, account data, customer data, product data, and external data, the possible feature space is enormous. Data science teams typically build 200-500 features and iterate for 3-6 months before production deployment.

Deep learning on transaction sequences

Some banks use LSTMs or Transformers on raw transaction sequences. Instead of aggregating transactions into features, the model reads the full sequence: amount, merchant category, timestamp, channel. It learns temporal patterns that aggregation destroys: spending velocity changes, category shifts, and payment timing drift.

This approach adds 5-10% accuracy improvement over feature-based ML on transaction data alone. But it only sees one data source. Account relationships, counterparty risk, and cross-product behavior are outside its view.

Traditional scorecard

  • 10-20 variables from credit bureau data
  • Logistic regression with assumed linear relationships
  • Point-in-time snapshot, no temporal dynamics
  • Each borrower treated as an independent entity
  • Annual recalibration cycle

Relational ML model

  • Hundreds of signals from transactions, accounts, and network
  • Non-linear patterns and interaction effects captured automatically
  • Temporal sequences reveal deterioration 3-6 months early
  • Counterparty and network risk propagated through the graph
  • Continuous learning from new transaction data

The relational approach to credit risk

Relational deep learning, published at ICML 2024, showed that representing a relational database as a temporal graph enables ML models to learn directly from multi-table data without feature engineering. For credit risk, this means the model sees the borrower not as a row of aggregated features but as a node in a graph connected to their transactions, accounts, counterparties, products, and historical events.

The graph neural network propagates information along these connections. A borrower's risk assessment incorporates the financial health of their employers, the behavior of their co-borrowers, the performance of similar borrowers who share transaction patterns, and the temporal trajectory of their own financial behavior. All of this happens automatically, without a data scientist specifying which features to extract.

What the model discovers

When trained on a bank's full relational data, graph models discover credit risk signals that no scorecard captures.

Spending composition shifts. A borrower whose transaction mix shifts from discretionary spending (restaurants, travel) to essential spending (groceries, utilities) over a 4-week period has elevated risk, even if total spend is unchanged. The model learns this from the transaction-merchant category graph.

spending_composition: Borrower B-3305

weekrestaurantstravelgroceriescash_advancestotal_spend
Week 1$420$280$180$0$880
Week 2$310$0$240$500$1,050
Week 3$85$0$290$1,800$2,175
Week 4$0$0$310$3,600$3,910

Highlighted: discretionary spending (restaurants, travel) collapsed from $700 to $0 over 4 weeks. Cash advances spiked from $0 to $3,600. Total spend actually increased, which means utilization-based scorecards see this as stable. The composition shift is invisible.

flat_scorecard_view (what the model sees)

borrower_idavg_monthly_spendutilizationon_time_paymentsscorecard_risk
B-3305$2,00445%12/12Low-Medium
B-3301$1,89038%12/12Low

B-3305 and B-3301 look similar in the scorecard. Both have on-time payments and moderate utilization. The scorecard has no column for 'cash advance ratio' or 'spending category trajectory'. B-3305 is 4 weeks from default.

Payment timing drift. A borrower who pays on day 5 of the billing cycle for 8 months and then gradually shifts to day 25 is exhibiting a pattern that precedes missed payments. The temporal model captures this drift; a scorecard sees "on-time" until the first late payment.

payment_timing: Borrower B-3302

monthpayment_daygrace_period_enddays_remainingscorecard_status
SepDay 5Day 2823 daysOn-time
OctDay 8Day 2820 daysOn-time
NovDay 14Day 2814 daysOn-time
DecDay 19Day 289 daysOn-time
JanDay 24Day 284 daysOn-time
FebDay 31Day 28-3 daysLATE

Highlighted: payment day drifted from day 5 to day 31 over 6 months. The scorecard recorded 'on-time' for 5 months because the grace period was not exceeded. By the time the scorecard detects the first late payment, the borrower has been deteriorating for half a year.

Network contagion. When a borrower's primary employer's payroll account shows reduced activity, employees at that company face elevated risk. The model propagates this signal through the employer-employee-account graph before individual borrower behavior changes.

employer_payroll_health

employerpayroll_deposits_q3payroll_deposits_q4changeemployees_affected
TechStartup Inc24 (bi-weekly)18 (irregular)-25%B-3302, B-3307, B-3312
MegaCorp LLC24 (bi-weekly)24 (bi-weekly)StableB-3301, B-3304

TechStartup Inc's payroll deposits dropped 25% and became irregular, a sign of financial distress. B-3302 is employed there. The relational model propagates this risk signal through the employer-employee graph. The scorecard knows nothing about the employer's financial health.

model_comparison (90-day default prediction)

borrower_idscorecard_PDML_PDactual_outcomeearly_warning
B-33012.1%1.8%No default
B-33028.4%34.2%Default (Day 68)ML flagged 52 days early
B-330312.7%9.1%No default
B-33042.3%2.0%No default
B-33055.9%41.8%Default (Day 45)ML flagged 38 days early

Highlighted: the scorecard gave B-3302 an 8.4% PD and B-3305 a 5.9% PD. Both defaulted. Relational ML saw the transaction behavior deterioration and flagged them at 34.2% and 41.8% respectively.

PQL Query

PREDICT default_90d
FOR EACH borrowers.borrower_id
WHERE borrowers.account_status = 'active'

Predict 90-day probability of default for every active borrower. The model incorporates transaction-level behavior (spending velocity, cash advance frequency, payment timing drift), counterparty health (employer payroll activity), and network risk (co-borrower and guarantor default rates).

Output

borrower_idPD_90drisk_tiertop_signalearly_warning_days
B-330541.8%HighCash advance spike + paycheck irregular38
B-330234.2%HighPayment drift + spending shift52
B-33039.1%MediumUtilization elevated but stable
B-33011.8%LowAll indicators stable
B-33042.0%LowAll indicators stable

Regulatory considerations

Model risk management guidance (SR 11-7, SS1/23) requires that models be validated, documented, and governed regardless of methodology. ML models are held to the same standards as scorecards. The additional requirements for ML are explainability and fairness testing.

In practice, many banks use a dual approach: ML models for risk screening, portfolio monitoring, and early warning systems, where the regulatory bar for explainability is lower; and traditional scorecards for final credit decisioning, where explainability requirements are highest. The ML model identifies borrowers whose risk profile is changing. The scorecard makes the final approve/decline decision.

This is evolving. The OCC's 2023 guidance explicitly acknowledges that ML models can improve risk management and does not prohibit their use in decisioning, provided adequate governance is in place. The EU AI Act classifies credit scoring as "high risk" but does not prohibit ML, requiring instead transparency, human oversight, and bias testing.

The foundation model path

KumoRFM brings the foundation model approach to credit risk. The model is pre-trained on relational patterns across thousands of databases, including financial transaction patterns, temporal behavioral dynamics, and network effects. It has already learned the universal signals of financial stress: spending composition changes, payment timing drift, utilization velocity, and counterparty risk propagation.

A bank connects its relational database and writes a predictive query:

PREDICT default_90d FOR borrowers

The model returns a probability of default for every borrower, incorporating the full relational context. No feature engineering, no 6-month development cycle, no annual recalibration schedule. The model updates as new data arrives, capturing deteriorating patterns in real time.

For a bank with $50 billion in consumer credit exposure, a 20% reduction in default prediction error translates to tens of millions in reduced credit losses annually. The cost of achieving this with traditional ML is a team of data scientists working for months. The cost with a foundation model is a database connection and a query.

Credit risk modeling has been constrained by 1950s data and 1990s methods. The data has caught up. The methods are catching up now. The banks that use the full relational signal will price risk more accurately, detect deterioration earlier, and extend credit more broadly and more safely than those still relying on 20-variable scorecards.

Frequently asked questions

What is credit risk modeling?

Credit risk modeling estimates the probability that a borrower will fail to meet their financial obligations. It encompasses three components: probability of default (PD), loss given default (LGD), and exposure at default (EAD). Traditional models use logistic regression scorecards with 10-20 variables derived from credit bureau data. ML-based models use hundreds or thousands of features from transaction data, account behavior, and relational patterns to improve prediction accuracy.

Why are traditional credit scorecards limited?

Traditional scorecards use a small set of aggregated variables (outstanding balance, payment history, credit utilization, account age) and assume linear relationships. They cannot capture temporal dynamics (a borrower whose payment behavior is deteriorating week over week), relational patterns (borrowers whose counterparties are defaulting), or interaction effects between variables. They were designed for interpretability and regulatory compliance in an era before ML, not for maximum predictive accuracy.

How does ML improve credit risk prediction?

ML models improve credit risk prediction in three ways: (1) they use hundreds of variables instead of 10-20, including transaction-level data that scorecards aggregate away; (2) they capture non-linear relationships and interaction effects between variables; (3) they can incorporate alternative data sources like payment patterns on utilities, rent, and subscriptions. Studies show ML models reduce default prediction error by 20-40% compared to traditional scorecards.

What is the role of relational data in credit risk?

Credit risk is inherently relational. A borrower's default probability depends not just on their own behavior but on the health of their counterparties, employers, and financial network. Transaction patterns reveal spending discipline better than aggregated balances. Account relationships show whether a borrower is centralizing or diversifying their financial exposure. Relational ML captures these multi-table patterns that traditional models cannot see.

How do regulators view ML-based credit risk models?

Regulatory acceptance of ML in credit risk is evolving. The OCC, Fed, and FDIC have issued guidance acknowledging that ML can improve risk management while emphasizing the need for explainability, fairness testing, and model governance. SR 11-7 requires model validation regardless of methodology. The practical path is using ML for risk screening and monitoring while maintaining interpretable models for final decisioning where required.

See it in action

KumoRFM delivers predictions on relational data in seconds. No feature engineering, no ML pipelines. Try it free.