What is credit risk modeling?

Credit risk modeling estimates the probability that a borrower will fail to meet their financial obligations. It encompasses three components: probability of default (PD), loss given default (LGD), and exposure at default (EAD). Traditional models use logistic regression scorecards with 10-20 variables derived from credit bureau data. ML-based models use hundreds or thousands of features from transaction data, account behavior, and relational patterns to improve prediction accuracy.

Why are traditional credit scorecards limited?

Traditional scorecards use a small set of aggregated variables (outstanding balance, payment history, credit utilization, account age) and assume linear relationships. They cannot capture temporal dynamics (a borrower whose payment behavior is deteriorating week over week), relational patterns (borrowers whose counterparties are defaulting), or interaction effects between variables. They were designed for interpretability and regulatory compliance in an era before ML, not for maximum predictive accuracy.

How does ML improve credit risk prediction?

ML models improve credit risk prediction in three ways: (1) they use hundreds of variables instead of 10-20, including transaction-level data that scorecards aggregate away; (2) they capture non-linear relationships and interaction effects between variables; (3) they can incorporate alternative data sources like payment patterns on utilities, rent, and subscriptions. Studies show ML models reduce default prediction error by 20-40% compared to traditional scorecards.

What is the role of relational data in credit risk?

Credit risk is inherently relational. A borrower's default probability depends not just on their own behavior but on the health of their counterparties, employers, and financial network. Transaction patterns reveal spending discipline better than aggregated balances. Account relationships show whether a borrower is centralizing or diversifying their financial exposure. Relational ML captures these multi-table patterns that traditional models cannot see.

How do regulators view ML-based credit risk models?

Regulatory acceptance of ML in credit risk is evolving. The OCC, Fed, and FDIC have issued guidance acknowledging that ML can improve risk management while emphasizing the need for explainability, fairness testing, and model governance. SR 11-7 requires model validation regardless of methodology. The practical path is using ML for risk screening and monitoring while maintaining interpretable models for final decisioning where required.

Credit Risk Modeling with ML: Why Relational Data Changes the Game | Kumo.ai

A traditional credit scorecard uses 10 to 20 variables: outstanding balance, payment history, credit utilization, number of accounts, account age, recent inquiries. These variables are derived from credit bureau data and compressed into a single score. The model is logistic regression. The relationships are assumed to be linear. The scorecard is recalibrated annually.

This framework was built for an era when a credit bureau file was the only data available. Today, a bank has access to transaction-level data, account behavior, payment velocity, counterparty information, employment data, and real-time financial flows. The scorecard uses none of it.

The result: traditional models misclassify 15-25% of borrowers. Some are scored too high and default unexpectedly. Others are scored too low and are denied credit they would have repaid. Both errors have direct financial consequences.

borrower_scorecard_view

borrower_id	FICO	utilization	payment_history	accounts	score_risk
B-3301	742	38%	12/12 on-time	4	Low
B-3302	718	52%	11/12 on-time	6	Medium
B-3303	695	67%	10/12 on-time	3	Medium
B-3304	731	41%	12/12 on-time	5	Low
B-3305	709	45%	12/12 on-time	4	Low-Medium

Five borrowers scored by a traditional 20-variable scorecard. B-3301 and B-3304 both appear low risk. But the scorecard cannot see what is happening inside their transaction accounts.

transaction_behavior (last 90 days)

borrower_id	spend_trend	cash_advances	payment_timing	paycheck_status	merchant_shift
B-3301	Stable	0	Day 5 of cycle	Regular bi-weekly	None
B-3302	+35% in 30 days	3 ($4,200)	Day 5 > Day 22 drift	Missed last deposit	Discretionary > essentials
B-3303	Declining	0	Day 10 stable	Regular weekly	None
B-3304	Stable	0	Day 8 of cycle	Regular bi-weekly	None
B-3305	+60% in 14 days	5 ($8,900)	Day 3 > Day 28 drift	Irregular since Feb	Restaurants > cash advances

Highlighted: B-3302 and B-3305 show classic pre-default behavioral patterns: cash advance spikes, payment timing drift, spending composition shifts. The scorecard rates both as low-medium risk.

What scorecards miss

The limitations of traditional credit scorecards are not in the math. Logistic regression is a fine algorithm. The limitations are in the data it can consume and the patterns it can represent.

Transaction-level behavior

A scorecard sees "credit utilization: 45%." It does not see that utilization spiked from 20% to 45% in two weeks, driven by a series of cash advances rather than retail purchases. It does not see that the borrower's paycheck deposits stopped three weeks ago. It does not see that spending shifted from groceries and utilities to cash advances and overdraft-protected transfers.

Transaction-level data contains behavioral signals that aggregates destroy. The velocity, composition, and timing of transactions are strong predictors of financial stress. A borrower making minimum payments on the due date every month for 12 months has a different risk profile than one making the same minimum payments but progressively later in the grace period. The scorecard sees both as "12 on-time payments."

Counterparty and network risk

Credit risk is not independent. A borrower whose primary employer is experiencing financial distress has elevated risk, regardless of their personal payment history. A borrower whose major business clients are defaulting on their own obligations faces cascading risk. A small business whose suppliers are experiencing liquidity problems may face inventory disruptions that affect revenue.

These are relational patterns: the borrower's risk depends on the risk of entities they are connected to. Scorecards treat each borrower in isolation. Relational models propagate risk through the network.

Temporal dynamics

A borrower with a 720 FICO score and deteriorating transaction patterns has a different risk profile than a borrower with a 720 FICO score and stable patterns. The score is a point-in-time snapshot. It does not capture the trajectory. By the time the deterioration shows up in the scorecard variables (missed payments, increased utilization), the risk event may already be underway.

How ML improves credit risk models

ML-based credit risk models address the limitations of scorecards in three ways: more variables, non-linear patterns, and transaction-level granularity.

Gradient boosted trees on engineered features

The most common ML approach today: data scientists extract hundreds of features from transaction data, account history, and bureau files, then train XGBoost or LightGBM. This typically reduces default prediction error by 20-30%compared to logistic regression scorecards, according to multiple published studies including research from the Bank of England and the European Central Bank.

The limitation is the feature engineering bottleneck. Someone has to decide which transaction aggregates to compute, which time windows to use, and which cross-table features to construct. For a bank with transaction data, account data, customer data, product data, and external data, the possible feature space is enormous. Data science teams typically build 200-500 features and iterate for 3-6 months before production deployment.

Deep learning on transaction sequences

Some banks use LSTMs or Transformers on raw transaction sequences. Instead of aggregating transactions into features, the model reads the full sequence: amount, merchant category, timestamp, channel. It learns temporal patterns that aggregation destroys: spending velocity changes, category shifts, and payment timing drift.

This approach adds 5-10% accuracy improvement over feature-based ML on transaction data alone. But it only sees one data source. Account relationships, counterparty risk, and cross-product behavior are outside its view.

Traditional scorecard

10-20 variables from credit bureau data
Logistic regression with assumed linear relationships
Point-in-time snapshot, no temporal dynamics
Each borrower treated as an independent entity
Annual recalibration cycle

Relational ML model

Hundreds of signals from transactions, accounts, and network
Non-linear patterns and interaction effects captured automatically
Temporal sequences reveal deterioration 3-6 months early
Counterparty and network risk propagated through the graph
Continuous learning from new transaction data

The relational approach to credit risk

Relational deep learning, published at ICML 2024, showed that representing a relational database as a temporal graph enables ML models to learn directly from multi-table data without feature engineering. For credit risk, this means the model sees the borrower not as a row of aggregated features but as a node in a graph connected to their transactions, accounts, counterparties, products, and historical events.

The graph neural network propagates information along these connections. A borrower's risk assessment incorporates the financial health of their employers, the behavior of their co-borrowers, the performance of similar borrowers who share transaction patterns, and the temporal trajectory of their own financial behavior. All of this happens automatically, without a data scientist specifying which features to extract.

What the model discovers

When trained on a bank's full relational data, graph models discover credit risk signals that no scorecard captures.

Spending composition shifts. A borrower whose transaction mix shifts from discretionary spending (restaurants, travel) to essential spending (groceries, utilities) over a 4-week period has elevated risk, even if total spend is unchanged. The model learns this from the transaction-merchant category graph.

spending_composition: Borrower B-3305

week	restaurants	travel	groceries	cash_advances	total_spend
Week 1	$420	$280	$180	$0	$880
Week 2	$310	$0	$240	$500	$1,050
Week 3	$85	$0	$290	$1,800	$2,175
Week 4	$0	$0	$310	$3,600	$3,910

Highlighted: discretionary spending (restaurants, travel) collapsed from $700 to $0 over 4 weeks. Cash advances spiked from $0 to $3,600. Total spend actually increased, which means utilization-based scorecards see this as stable. The composition shift is invisible.

flat_scorecard_view (what the model sees)

borrower_id	avg_monthly_spend	utilization	on_time_payments	scorecard_risk
B-3305	$2,004	45%	12/12	Low-Medium
B-3301	$1,890	38%	12/12	Low

B-3305 and B-3301 look similar in the scorecard. Both have on-time payments and moderate utilization. The scorecard has no column for 'cash advance ratio' or 'spending category trajectory'. B-3305 is 4 weeks from default.

Payment timing drift. A borrower who pays on day 5 of the billing cycle for 8 months and then gradually shifts to day 25 is exhibiting a pattern that precedes missed payments. The temporal model captures this drift; a scorecard sees "on-time" until the first late payment.

payment_timing: Borrower B-3302

month	payment_day	grace_period_end	days_remaining	scorecard_status
Sep	Day 5	Day 28	23 days	On-time
Oct	Day 8	Day 28	20 days	On-time
Nov	Day 14	Day 28	14 days	On-time
Dec	Day 19	Day 28	9 days	On-time
Jan	Day 24	Day 28	4 days	On-time
Feb	Day 31	Day 28	-3 days	LATE

Highlighted: payment day drifted from day 5 to day 31 over 6 months. The scorecard recorded 'on-time' for 5 months because the grace period was not exceeded. By the time the scorecard detects the first late payment, the borrower has been deteriorating for half a year.

Network contagion. When a borrower's primary employer's payroll account shows reduced activity, employees at that company face elevated risk. The model propagates this signal through the employer-employee-account graph before individual borrower behavior changes.

employer_payroll_health

employer	payroll_deposits_q3	payroll_deposits_q4	change	employees_affected
TechStartup Inc	24 (bi-weekly)	18 (irregular)	-25%	B-3302, B-3307, B-3312
MegaCorp LLC	24 (bi-weekly)	24 (bi-weekly)	Stable	B-3301, B-3304

TechStartup Inc's payroll deposits dropped 25% and became irregular, a sign of financial distress. B-3302 is employed there. The relational model propagates this risk signal through the employer-employee graph. The scorecard knows nothing about the employer's financial health.

model_comparison (90-day default prediction)

borrower_id	scorecard_PD	ML_PD	actual_outcome	early_warning
B-3301	2.1%	1.8%	No default	—
B-3302	8.4%	34.2%	Default (Day 68)	ML flagged 52 days early
B-3303	12.7%	9.1%	No default	—
B-3304	2.3%	2.0%	No default	—
B-3305	5.9%	41.8%	Default (Day 45)	ML flagged 38 days early

Highlighted: the scorecard gave B-3302 an 8.4% PD and B-3305 a 5.9% PD. Both defaulted. Relational ML saw the transaction behavior deterioration and flagged them at 34.2% and 41.8% respectively.

PQL Query

PREDICT default_90d
FOR EACH borrowers.borrower_id
WHERE borrowers.account_status = 'active'

Predict 90-day probability of default for every active borrower. The model incorporates transaction-level behavior (spending velocity, cash advance frequency, payment timing drift), counterparty health (employer payroll activity), and network risk (co-borrower and guarantor default rates).

Output

borrower_id	PD_90d	risk_tier	top_signal	early_warning_days
B-3305	41.8%	High	Cash advance spike + paycheck irregular	38
B-3302	34.2%	High	Payment drift + spending shift	52
B-3303	9.1%	Medium	Utilization elevated but stable	—
B-3301	1.8%	Low	All indicators stable	—
B-3304	2.0%	Low	All indicators stable	—

Regulatory considerations

Model risk management guidance (SR 11-7, SS1/23) requires that models be validated, documented, and governed regardless of methodology. ML models are held to the same standards as scorecards. The additional requirements for ML are explainability and fairness testing.

In practice, many banks use a dual approach: ML models for risk screening, portfolio monitoring, and early warning systems, where the regulatory bar for explainability is lower; and traditional scorecards for final credit decisioning, where explainability requirements are highest. The ML model identifies borrowers whose risk profile is changing. The scorecard makes the final approve/decline decision.

This is evolving. The OCC's 2023 guidance explicitly acknowledges that ML models can improve risk management and does not prohibit their use in decisioning, provided adequate governance is in place. The EU AI Act classifies credit scoring as "high risk" but does not prohibit ML, requiring instead transparency, human oversight, and bias testing.

The foundation model path

KumoRFM brings the foundation model approach to credit risk. The model is pre-trained on relational patterns across thousands of databases, including financial transaction patterns, temporal behavioral dynamics, and network effects. It has already learned the universal signals of financial stress: spending composition changes, payment timing drift, utilization velocity, and counterparty risk propagation.

A bank connects its relational database and writes a predictive query:

PREDICT default_90d FOR borrowers

The model returns a probability of default for every borrower, incorporating the full relational context. No feature engineering, no 6-month development cycle, no annual recalibration schedule. The model updates as new data arrives, capturing deteriorating patterns in real time.

For a bank with $50 billion in consumer credit exposure, a 20% reduction in default prediction error translates to tens of millions in reduced credit losses annually. The cost of achieving this with traditional ML is a team of data scientists working for months. The cost with a foundation model is a database connection and a query.

Credit risk modeling has been constrained by 1950s data and 1990s methods. The data has caught up. The methods are catching up now. The banks that use the full relational signal will price risk more accurately, detect deterioration earlier, and extend credit more broadly and more safely than those still relying on 20-variable scorecards.

Key Takeaways

1Traditional scorecards use 10-20 variables and misclassify 15-25% of borrowers. The data available to assess credit risk has grown by orders of magnitude, but scorecards have not kept up.
2Transaction-level anomalies precede default events by 3-6 months. Scorecards detect deterioration 1-2 months before default. That 4-month gap is the difference between proactive risk management and loss mitigation.
3Cash advance spikes, payment timing drift, and spending composition shifts are strong default predictors that aggregated scorecard variables destroy. A borrower making minimum payments progressively later looks identical to one paying on day 5 every month.
4Counterparty and network risk is invisible to scorecards. A borrower whose employer's payroll account shows reduced activity faces elevated risk regardless of personal payment history.
5For a bank with $50B in consumer credit exposure, a 20% reduction in default prediction error translates to tens of millions in reduced credit losses annually. ML models achieve this by reading the full relational context, not by using better math on the same 20 variables.

Credit Risk Modeling with ML: Why Relational Data Changes the Game