Why do graph neural networks outperform collaborative filtering for content recommendations?

Collaborative filtering only sees user-item interaction matrices (who upvoted what). GNNs model the full relational structure: users, posts, subreddits, comments, and all the relationships between them. This lets GNNs capture community overlap patterns, content quality signals from comment depth, user interest evolution over time, and cross-modality preferences. On RelBench recommendation benchmarks, GNN-based approaches achieve 7.29 MAP@K versus 1.85 for GraphSAGE and 1.79 for LightGBM - a 4x improvement.

What is a heterogeneous graph and why does it matter for recommendations?

A heterogeneous graph contains multiple types of nodes and edges. In a content platform like Reddit, users, posts, subreddits, and comments are different node types. Upvotes, subscriptions, authorship, and comment replies are different edge types. This structure preserves the full richness of the data, unlike a flat user-item matrix that discards entity types and relationship semantics. GNNs designed for heterogeneous graphs learn different patterns for each relationship type.

How do GNNs handle cold-start recommendations?

Cold-start is one of the hardest problems for collaborative filtering because new users and new items have no interaction history. GNNs solve this by propagating information through the graph structure. A new subreddit can receive recommendations based on the topics it covers, the users who created it, and its similarity to existing subreddits - even before anyone has interacted with it. A new user can get recommendations based on the subreddits they subscribe to during onboarding, because those subscriptions connect them to the broader graph.

What does '4-5 years of improvement in 2 months' mean in practice?

Traditional recommendation systems improve incrementally through manual feature engineering: adding new interaction signals, tuning decay windows, engineering engagement features, and iterating on model architectures. Each cycle of improvement takes months and yields small accuracy gains. A GNN-based approach using relational deep learning can discover many of these patterns automatically from the graph structure, compressing years of iterative feature engineering work into a single training cycle.

Can GNN-based recommendations work for platforms smaller than Reddit?

Yes. The advantage of GNN-based recommendations scales with data complexity, not data volume. Any platform with multiple entity types (users, items, categories, interactions) and rich relational structure benefits from the graph approach. Smaller platforms often benefit more, because they lack the engineering resources to build and maintain the hundreds of hand-crafted features that large platforms invest in over years.

How does Predictive Query Language (PQL) simplify building recommendation systems?

PQL lets you express recommendation tasks in 2-3 lines instead of building an entire ML pipeline. A query like 'PREDICT engagement FOR EACH users.user_id, posts.post_id' tells the foundation model what to predict and for whom. The model reads the raw relational tables - users, posts, subreddits, comments, votes - discovers the predictive patterns, and returns ranked recommendations. No feature engineering, no model selection, no pipeline maintenance.

How Reddit Uses Graph Neural Networks for Recommendations | Kumo.ai

Reddit is one of the most complex recommendation environments on the internet. Billions of posts across millions of subreddits, hundreds of millions of users with constantly shifting interests, and a community-driven structure where context matters as much as content. A post that thrives in r/MachineLearning might be irrelevant in r/datascience, despite overlapping audiences.

For years, content platforms like Reddit improved their recommendations incrementally: adding new engagement signals, tuning collaborative filtering models, engineering features one at a time. Each iteration cycle took months and yielded small accuracy gains. Then a graph-based approach compressed 4-5 years of that iterative improvement into 2 months.

This article explains why. Not by speculating about Reddit's internal systems, but by analyzing what makes content recommendation fundamentally a graph problem - and why relational deep learning discovers patterns that flat-table approaches structurally cannot.

The headline result: SAP SALT benchmark

The SAP SALT benchmark is an enterprise-grade evaluation where real business analysts and data scientists attempt prediction tasks on SAP enterprise data. It measures how accurately different approaches predict real business outcomes on production-quality enterprise databases with multiple related tables.

sap_salt_enterprise_benchmark

approach	accuracy	what_it_means
LLM + AutoML	63%	Language model generates features, AutoML selects model
PhD Data Scientist + XGBoost	75%	Expert spends weeks hand-crafting features, tunes XGBoost
KumoRFM (zero-shot)	91%	No feature engineering, no training, reads relational tables directly

SAP SALT benchmark: KumoRFM outperforms expert data scientists by 16 percentage points and LLM+AutoML by 28 percentage points on real enterprise prediction tasks.

KumoRFM scores 91% where PhD-level data scientists with weeks of feature engineering and hand-tuned XGBoost score 75%. The 16 percentage point gap is the value of reading relational data natively instead of flattening it into a single table.

The content recommendation challenge

To recommend content effectively on a platform like Reddit, a system needs to understand multiple interacting signals simultaneously:

User interests - inferred from upvotes, comments, subscriptions, time spent reading, and what users choose to skip.
Subreddit relationships - similar communities share overlapping user bases (r/MachineLearning and r/datascience), topical connections (r/cooking and r/MealPrepSunday), or cultural similarities.
Post quality signals - karma, comment depth, upvote-to-view ratio, whether comments are substantive or shallow.
Temporal freshness - a breaking news post matters now; a tutorial is relevant for months. Different content types have different decay curves.
Cross-community interests - a user active in r/Python and r/datascience might enjoy a post in r/MLOps that they have never visited.

Each of these signals lives in a different table. Users in one. Posts in another. Subreddits in a third. Comments, votes, and subscriptions each in their own tables. The relationships between these entities - who posted where, who commented on what, which subreddits share members - are where the predictive signal lives.

Why collaborative filtering plateaus for content platforms

Traditional recommendation systems treat the problem as a user-item interaction matrix. User A upvoted posts 1, 3, and 7. User B upvoted posts 1, 3, and 9. Therefore User A might like post 9. This is collaborative filtering, and it works reasonably well for simple product recommendations.

But for content platforms with rich relational structure, collaborative filtering hits fundamental limits:

signals_collaborative_filtering_misses

signal	what_CF_sees	what_is_lost
Community structure	User upvoted Post X	Post X is in a subreddit cluster (r/ML, r/datascience, r/AI) with 60% user overlap
Comment depth as engagement	User interacted with Post Y	Post Y generated 200+ deep comment threads - a quality signal visible only in the comment graph
User interest evolution	User's recent upvotes	User shifted from r/learnpython to r/MachineLearning to r/MLOps over 6 months - a trajectory, not a snapshot
Cross-modality patterns	User reads text posts	User who reads ML text posts in r/MachineLearning may engage with ML video tutorials in r/learnmachinelearning
Cold-start subreddits	No data (new community)	New subreddit r/LLMOps is topically similar to r/MLOps and r/LangChain - inferrable from description, creator history, and early subscribers

Collaborative filtering sees the user-item interaction matrix. Everything in the third column - community structure, engagement quality, interest trajectories, cross-modality patterns, and cold-start inference - requires reading the full relational graph.

Each of these blind spots can be addressed individually by engineering features: compute subreddit similarity scores, build comment depth metrics, create user interest trajectory features. But each feature takes weeks to design, implement, validate, and deploy. After 4-5 years of this iterative process, a mature recommendation system has hundreds of hand-crafted features - and still misses interaction patterns that were never explicitly engineered.

The graph approach to content recommendations

Users, posts, subreddits, and comments naturally form a massive heterogeneous graph. Each entity type is a different kind of node. Each relationship - upvotes, subscriptions, authorship, comment replies - is a different kind of edge.

reddit_as_heterogeneous_graph

node_type	examples	key_attributes
User	Hundreds of millions	Account age, karma, activity pattern, subscriptions
Post	Billions	Title, content type (text/image/video/link), karma, timestamp
Subreddit	Millions	Topic, subscriber count, activity level, rules, related communities
Comment	Tens of billions	Text, depth in thread, karma, timestamp, parent comment

Each entity type becomes a node in the graph. The relationships between them - upvotes, posts, subscriptions, replies - become edges. This is the natural structure of the data.

A graph neural network processes this structure by passing messages along edges. Information about a subreddit's user base flows to the posts within it. Information about comment quality flows up to the post. Information about a user's subscription patterns flows to their activity predictions. Each message-passing layer lets information travel one hop further through the graph.

After multiple layers, the GNN has learned representations that encode multi-hop patterns:

Community overlap patterns. Users who subscribe to r/MachineLearning and r/statistics have different content preferences than users who subscribe to r/MachineLearning and r/startups - even though both groups are in the same subreddit. The GNN captures this through the subscription edges.
Content quality propagation. A post's quality is not just its karma score. It is the depth and substance of its comment threads, the reputation of its commenters, and the engagement patterns of similar posts in related subreddits. These signals propagate through upvote and comment edges.
User interest evolution. By preserving temporal information on edges, the GNN learns trajectories: a user moving from beginner to advanced topics, shifting from one domain to another, increasing or decreasing engagement over time.
Cold-start inference. A new subreddit with 50 subscribers is connected to the graph through those subscribers' other activity. If its early members are all active in data engineering communities, the GNN infers what content belongs there - without waiting for thousands of interactions.

4-5 years of improvement in 2 months

The key result: what took 4-5 years of iterative collaborative filtering improvement was achieved in 2 months with relational deep learning. This is not about a better algorithm marginally outperforming the old one. It is about a fundamentally different representation of the data.

Years of manual feature engineering - computing subreddit similarity matrices, building user interest decay functions, engineering comment quality scores, creating cross-community affinity features - were replaced by a model that reads the raw relational structure and discovers these patterns automatically.

The GNN did not just replicate the hand-crafted features. It discovered patterns that years of feature engineering had missed: interaction effects between community structure and content type, temporal patterns in cross-community migration, and engagement signals that only become visible when you model the full graph.

Collaborative Filtering (4-5 years iterative)

Flattens data to user-item interaction matrix
Each new signal requires manual feature engineering
Misses community structure and cross-entity patterns
Cold-start requires separate heuristic systems
Improvement cycle: months per incremental gain

GNN-Based Recommendations (2 months)

Reads the full heterogeneous graph directly
Discovers signals automatically from relational structure
Captures community overlap, comment quality, interest evolution
Cold-start handled through graph connectivity
Discovers patterns that years of feature engineering missed

RelBench recommendation benchmarks

The RelBench benchmark provides an independent measure of how different approaches perform on recommendation tasks across real-world relational databases. The results quantify the gap between flat-table approaches and graph-based methods:

relbench_recommendation_results

approach	MAP@K	approach_type	what_it_reads
LightGBM	1.79	Tabular ML + manual features	Flat feature table
GraphSAGE	1.85	Basic GNN	Graph structure (limited message passing)
KumoRFM	7.29	Foundation model for relational data	Full heterogeneous temporal graph

Highlighted: KumoRFM achieves 4x the MAP@K of both tabular and basic GNN approaches on recommendation tasks. The gap comes from reading the full relational structure with a pre-trained foundation model, not just applying a GNN architecture.

The 4x improvement is not from a better algorithm on the same data. It is from a better representation of the data. LightGBM sees a flat feature table. GraphSAGE sees graph structure but with limited expressiveness. KumoRFM reads the full heterogeneous temporal graph with a model pre-trained on thousands of diverse relational databases.

What each approach captures

collaborative_filtering_vs_gnn_signals

signal_type	collaborative_filtering	GNN_based_recommendations
Direct user-item interactions	Yes (upvotes, clicks)	Yes (plus context from the full graph)
Community structure	No (requires manual clustering)	Yes (learned from subscription and activity edges)
Content quality (beyond karma)	No (requires engineered features)	Yes (propagated from comment depth and engagement patterns)
User interest evolution	Limited (recent window only)	Yes (temporal edges preserve full trajectory)
Cross-community discovery	No (limited to co-occurrence)	Yes (multi-hop paths through shared users and topics)
Cold-start entities	No (no interaction history)	Yes (inferred from graph connectivity)
Cross-modality preferences	No (separate models per content type)	Yes (content type is a node attribute, not a silo)
Multi-hop patterns	No (pairwise only)	Yes (user → subreddit → similar subreddit → trending post)

Collaborative filtering captures direct user-item interactions. GNN-based recommendations capture everything else: the relational structure that determines why a user will engage with content they have never seen.

Building recommendations with PQL

With a relational foundation model, building a content recommendation system does not require months of feature engineering and model iteration. It requires describing what you want to predict.

PQL Query

PREDICT engagement
FOR EACH users.user_id, posts.post_id
WHERE posts.created_at > CURRENT_DATE - INTERVAL '7 days'

One query replaces the entire recommendation pipeline: user profiling, content scoring, community analysis, and ranking. The foundation model reads raw relational tables - users, posts, subreddits, comments, votes - and discovers which content each user will engage with.

Output

user_id	post_id	engagement_score	primary_signal
U-44201	P-891034	0.92	Community overlap + interest trajectory
U-44201	P-891107	0.87	Cross-community topic match
U-44201	P-892441	0.71	High comment quality in related subreddit
U-44201	P-890022	0.13	Low community relevance

Why this matters beyond Reddit

Reddit's experience is a case study in a general pattern. Any platform with rich relational structure - users, items, categories, interactions, temporal dynamics - faces the same fundamental choice: flatten the data into feature tables and iterate for years, or model the relational graph directly and discover patterns in weeks.

E-commerce platforms have customers, products, categories, reviews, and browsing sessions. Streaming services have viewers, content, genres, ratings, and watch patterns. Social networks have users, posts, connections, groups, and engagement events. In every case, the predictive signal lives in the relationships between entities, not in any single flat table.

The 4-5 years vs 2 months result is not specific to Reddit. It is specific to the gap between manually engineering relational patterns into flat features and automatically learning them from the graph structure. That gap exists anywhere relational data powers recommendations.

Key Takeaways

1Content platforms like Reddit have a natural graph structure: users, posts, subreddits, and comments connected through upvotes, subscriptions, and replies. Collaborative filtering flattens this into a user-item matrix and loses most of the signal.
2GNN-based recommendations capture community overlap, content quality propagation, user interest evolution, cross-community discovery, and cold-start inference - patterns that collaborative filtering structurally cannot see without manual feature engineering.
3Reddit achieved 4-5 years of iterative collaborative filtering improvement in 2 months with relational deep learning. The GNN discovered patterns that years of hand-crafted feature engineering had missed.
4On RelBench recommendation benchmarks, KumoRFM achieves 7.29 MAP@K vs GraphSAGE at 1.85 and LightGBM at 1.79 - a 4x improvement from reading the full relational graph instead of flat feature tables.
5The pattern generalizes: any platform with rich relational structure (users, items, categories, interactions) benefits from modeling the graph directly rather than flattening it into feature tables.

How Reddit Uses Graph Neural Networks for Recommendations