Every B2B company has a lead scoring system. Almost none of them work well. A 2024 Gartner survey found that only 25% of sales teams trust the scores their marketing ops team produces. The rest either ignore them entirely or use them as one input among many gut-feel signals.
The problem is not execution. The problem is architecture. Manual point systems assign static weights to observable behaviors: +10 for visiting the pricing page, +5 for opening an email, +20 for having a VP title. These rules capture the signals that are obvious to a human sitting in a conference room. They miss everything else.
And everything else is where conversion actually lives.
crm_leads
| lead_id | company | title | source | point_score | status |
|---|---|---|---|---|---|
| L-1001 | Acme Corp | VP Engineering | Webinar | 72 | MQL |
| L-1002 | TechFlow Inc | Data Scientist | Organic | 31 | Open |
| L-1003 | GlobalBank | CTO | Referral | 85 | MQL |
| L-1004 | RetailMax | Dir. Analytics | Paid Ad | 68 | MQL |
| L-1005 | HealthStar | ML Engineer | Content DL | 44 | Open |
Point scores rank L-1003 highest. But the scoring system cannot see what their accounts and contacts are actually doing.
crm_activities (last 30 days)
| lead_id | activity | date | channel | account_contacts_active |
|---|---|---|---|---|
| L-1001 | Pricing page x3 | 2025-03-01 | Web | 1 of 1 |
| L-1002 | Demo request, case study DL, pricing page, API docs | 2025-02-15 to 03-10 | Web + Email | 4 of 6 |
| L-1003 | Opened 1 email | 2025-02-20 | 1 of 3 | |
| L-1004 | Clicked 2 ads | 2025-03-08 | Paid | 1 of 1 |
| L-1005 | GitHub repo starred, docs visited x12, API trial signup | 2025-02-28 to 03-12 | Web + GitHub | 3 of 4 |
Highlighted: L-1002 has multi-threaded account engagement with a buying-stage content sequence. L-1005 shows deep technical evaluation. Neither scores high on points.
How manual lead scoring works (and where it breaks)
A typical manual scoring model has 10 to 15 rules. They fall into two categories: demographic fit (title, company size, industry) and behavioral engagement (page views, email opens, form fills). Each action gets a point value. Points accumulate. When a lead crosses a threshold, it becomes an MQL and gets handed to sales.
This approach has three structural problems.
1. Static weights ignore context
Visiting the pricing page is worth +10 points whether it happens on day 1 of a buyer journey or day 90. But the predictive meaning is completely different. A pricing page visit after three product demos and a technical review signals imminent purchase intent. The same visit from a first-time visitor signals curiosity. The point system treats them identically.
2. Single-contact scoring misses account dynamics
B2B purchases are made by buying committees, not individuals. A CEB study found the average B2B deal involves 6.8 decision makers. When three people from the same account visit your site in the same week, that is a far stronger signal than one person visiting three times. Manual scoring systems sum individual contact scores. They do not model the account-level engagement pattern.
3. Point systems cannot learn
When the market shifts, when your product changes, when a new competitor enters, the scoring rules stay the same until someone manually updates them. At most companies, that update happens quarterly. In practice, many scoring models go 12-18 months between meaningful revisions.
What ML-based scoring actually looks at
When you train an ML model on the full relational CRM, it discovers patterns that no human would write as a scoring rule. Here are five real categories of signals that ML models find in CRM data.
Multi-threaded account engagement
The model learns that accounts where 3 or more contacts from different departments engage within a 14-day window convert at 4.2x the rate of single-contact engagement. This is not a single feature. It is a pattern across the contacts table, the activities table, and the accounts table, linked by foreign keys.
account_contacts (TechFlow Inc — L-1002's account)
| contact_id | name | department | title | activity_last_14d |
|---|---|---|---|---|
| CT-201 | Sam Rivera | Engineering | Data Scientist | API docs x4, demo request |
| CT-202 | Jordan Lee | Product | VP Product | Case study DL, pricing page |
| CT-203 | Taylor Kim | Engineering | ML Engineer | GitHub repo, docs x8 |
| CT-204 | Alex Chen | Finance | Dir. Procurement | Pricing page x2 |
4 contacts from 3 departments engaged in 14 days. This is a buying committee in motion. The point system sees L-1002 as a single low-scoring lead because Sam Rivera has no VP title.
flat_lead_table (what the point system sees)
| lead_id | title | company_size | email_opens | page_views | point_score |
|---|---|---|---|---|---|
| L-1002 | Data Scientist | 200 | 3 | 6 | 31 |
| L-1003 | CTO | 5,000 | 1 | 0 | 85 |
L-1002 scores 31 because 'Data Scientist' gets fewer title points than 'CTO'. The flat table has no column for 'number of distinct departments engaged at the account.' The 4-person buying committee is invisible.
Activity sequence patterns
The order of engagement matters more than the volume. A sequence of blog post, then case study, then pricing page, then demo request has a different conversion probability than demo request, then blog post, then silence. ML models trained on temporal activity data capture these sequences. Point systems cannot.
activity_sequence: Lead L-1002 (buying-stage sequence)
| date | activity | content_type | stage_signal |
|---|---|---|---|
| Feb 15 | Blog: 'ML for relational data' | Education | Awareness |
| Feb 20 | Case study: 'DoorDash 1.8% lift' | Validation | Consideration |
| Feb 28 | Pricing page (2 visits) | Commercial | Evaluation |
| Mar 5 | API documentation (12 pages) | Technical | Technical eval |
| Mar 10 | Demo request form | Conversion | Decision |
A textbook buying sequence: awareness, validation, evaluation, technical review, conversion intent. The order tells the story.
activity_sequence: Lead L-1004 (stalled)
| date | activity | content_type | stage_signal |
|---|---|---|---|
| Mar 8 | Clicked paid ad | Ad | Awareness |
| Mar 8 | Clicked second paid ad | Ad | Awareness |
| — | (silence) | — | — |
Two ad clicks on the same day, then nothing. No progression through content stages. Point system gave L-1004 a score of 68 (paid ads get high points). ML sees no buying sequence.
Account similarity to past wins
The model computes similarity not just on firmographic attributes but on the full relational profile: what products were discussed, what objections were raised, what the engagement cadence looked like, how many stakeholders were involved, and how the deal timeline compared to the average. Accounts that resemble past closed-won deals across these dimensions score higher, even if their point-system scores are average.
Negative signals and disengagement patterns
A lead who was highly engaged two months ago and has gone silent is not the same as a lead who was never engaged. The decay pattern carries signal. ML models learn that a specific drop in email open rates combined with no meeting activity for 21 days predicts a 73% probability of deal loss. Point systems only add. They do not model the trajectory.
Cross-object relationships
Leads from accounts that previously purchased a related product convert at higher rates. Leads referred by existing customers close faster. Leads whose companies share board members with current customers have shorter sales cycles. These patterns span 3-4 tables in the CRM. They are invisible to a system that only looks at the lead record.
Manual point system
- 10-15 static rules based on obvious behaviors
- Single-contact scoring ignores buying committee
- No temporal awareness: day 1 visit = day 90 visit
- Cannot learn from outcomes or adapt to market shifts
- Uses 2-3 CRM tables out of 8-12 available
ML on relational CRM data
- Discovers thousands of patterns across all CRM tables
- Models account-level engagement across contacts
- Captures activity sequences and timing patterns
- Continuously learns from conversion outcomes
- Finds multi-hop signals: lead to account to product to similar accounts
point_score_vs_ml_score
| lead_id | point_score | point_rank | ML_score | ML_rank | actual_outcome |
|---|---|---|---|---|---|
| L-1003 | 85 | #1 | 0.18 | #5 | No reply |
| L-1001 | 72 | #2 | 0.41 | #3 | Lost |
| L-1004 | 68 | #3 | 0.33 | #4 | Nurture |
| L-1005 | 44 | #4 | 0.87 | #1 | Closed Won ($92K) |
| L-1002 | 31 | #5 | 0.79 | #2 | Closed Won ($210K) |
Highlighted: the two deals that closed were ranked #4 and #5 by point scoring. ML ranked them #1 and #2 based on multi-threaded engagement and buying-stage content sequence.
PQL Query
PREDICT conversion FOR EACH leads.lead_id WHERE leads.status != 'Closed'
One line replaces the entire point-scoring system. The model considers account-level engagement, activity sequences, firmographic similarity to past wins, and temporal patterns across all CRM tables.
Output
| lead_id | conversion_prob | top_signal | recommended_action |
|---|---|---|---|
| L-1005 | 0.87 | Multi-contact technical eval (3 of 4) | Route to SE for demo |
| L-1002 | 0.79 | Buying-stage content sequence | Schedule exec call |
| L-1001 | 0.41 | Pricing intent but single-thread | Add contacts to nurture |
| L-1004 | 0.33 | Ad-driven, no product engagement | Content nurture |
| L-1003 | 0.18 | Single email open, no follow-up | Deprioritize |
The first-generation ML approach (and its limits)
Most companies that move beyond manual scoring adopt what we call first-generation ML: extract features from the CRM, flatten them into a table, and train XGBoost or a logistic regression model.
This is better than manual scoring. Forrester found that companies using predictive lead scoring see 30% higher win rates and 25% shorter sales cycles. But it still requires a data team to engineer features manually. Someone has to decide to compute "number of contacts at the account who opened an email in the last 14 days" and write the SQL to produce it.
The feature engineering takes 3-6 months for an initial deployment. It requires ongoing maintenance as the CRM schema evolves, new custom objects are added, and data quality issues surface. A Stanford study measured this at 12.3 hours per prediction task for experienced data scientists. For a lead scoring model with multiple segments and regular retraining, the total investment is substantial.
More importantly, the flat-table approach destroys the relational structure. When you aggregate "number of activities in last 30 days," you lose the sequence. When you compute "average deal size for the account," you lose the trajectory. The features are summaries. The signal is in the details.
How relational ML changes lead scoring
Relational deep learning, published at ICML 2024, showed that you can represent a relational database as a temporal heterogeneous graph. Rows become nodes. Foreign keys become edges. Timestamps create a temporal ordering. A graph neural network learns directly from this structure.
For lead scoring, this means the model sees the full CRM as a connected graph. A lead is a node connected to an account node, which is connected to contact nodes, activity nodes, opportunity nodes, and product nodes. The model propagates information along these connections, learning which patterns across the full graph predict conversion.
The result is a scoring model that captures multi-threaded engagement, temporal sequences, account similarity, and cross-object relationships without any manual feature engineering. No one has to decide which features to compute. The model discovers them.
What this looks like with KumoRFM
KumoRFM is a foundation model pre-trained on billions of relational patterns across thousands of databases. For lead scoring, you connect your CRM database and write a predictive query:
PREDICT conversion FOR leads
The model returns a conversion probability for every lead, based on the full relational context of your CRM. No feature engineering, no model training, no pipeline. The time from connected database to production scores is measured in minutes, not months.
Because KumoRFM has been pre-trained on diverse relational datasets, it already understands the universal patterns in CRM data: recency effects, engagement velocity, account-level dynamics, and temporal decay. It applies these learned patterns to your specific data without requiring your historical outcomes to build a model from scratch.
Measuring the impact
The business case for ML lead scoring is straightforward. Better scoring means sales spends more time on leads that will convert and less time on leads that will not.
Consider a B2B SaaS company with 10,000 MQLs per quarter. With manual scoring, the sales team accepts 40% and converts 8% of those. That is 320 deals from 10,000 leads. With ML scoring that is 15-40% more accurate, the same team converts 368 to 448 deals. At an average deal size of $50,000, that is $2.4M to $6.4M in incremental annual revenue from the same lead volume and sales team.
The cost side matters too. First-generation ML scoring requires a data science team to build and maintain the pipeline: 3-6 months for initial deployment, ongoing feature engineering as the CRM evolves, regular retraining, and monitoring for drift. A foundation model approach eliminates this infrastructure entirely. The model updates as your data changes. No pipeline to maintain.
If your sales team is telling you they do not trust the scores, the answer is not to tweak the point values. The answer is to replace the point system with a model that can actually see the patterns that predict conversion. Those patterns live in the relationships between your CRM tables. A system that flattens those relationships into points will always miss them.