Berlin Tech Meetup: The Future of Relational Foundation Models, Systems, and Real-World Applications

Register now:
Learn12 min read

Graph Transformers vs GNNs: Why Attention Beats Message Passing

Graph neural networks revolutionized learning on structured data. Graph transformers are now surpassing them by a significant margin. The difference comes down to one architectural decision: how information flows between nodes.

TL;DR

  • 1GNNs hit a fundamental ceiling: over-squashing compresses multi-hop signals from exponentially many source nodes into fixed-size vectors, destroying long-range patterns at 3-4 hops.
  • 2Graph transformers bypass this by letting every node attend to every other node directly. A 4-hop pattern is captured in a single layer with no information bottleneck.
  • 3On RelBench, graph transformers (KumoRFM fine-tuned) score 81.14 AUROC vs 75.83 for message-passing GNNs. The gap is largest on tasks requiring deep multi-hop reasoning.
  • 4Production feasibility is achieved through subgraph sampling: extract 500-5,000 relevant nodes per target, run attention within that subgraph in milliseconds on a modern GPU.
  • 5KumoRFM zero-shot (76.71 AUROC) outperforms task-specific trained GNNs (75.83) without any training on the target database, demonstrating cross-domain pattern transfer.

Graph neural networks were supposed to solve learning on relational data. The idea was elegant: represent your database as a graph, let nodes exchange messages with their neighbors, and stack enough layers for information to propagate across the structure. In practice, GNNs hit a ceiling. On the RelBench benchmark, standard message-passing GNNs score 75.83 AUROC on classification tasks. Graph transformers score 81.14 after fine-tuning.

That gap is not noise. It is roughly 10% relative improvement, consistent across diverse databases from e-commerce product graphs to clinical trial data. The explanation is architectural, and it reveals a fundamental limitation of message passing that no amount of engineering can fix.

attention_vs_message_passing — signal retention by hop distance

Hop DistanceGNN Signal RetentionGraph Transformer Signal RetentionExample Pattern
1-hop95-98%97-99%Customer → their orders
2-hop70-85%92-96%Customer → orders → products
3-hop30-50%85-93%Customer → products → other customers
4-hop5-15%78-90%Customer → similar customer outcomes
5-hop< 5%70-85%Customer → supplier network effects

GNN signal drops exponentially due to over-squashing. Graph transformers maintain 78-90% signal at 4 hops because every node attends to every other node directly.

RelBench — architecture comparison by task depth

Task TypeMessage-Passing GNNGraph Transformer (KumoRFM)Relative Improvement
Shallow tasks (1-2 hop signal)77.1 AUROC79.3 AUROC+2.9%
Medium tasks (2-3 hop signal)75.2 AUROC81.0 AUROC+7.7%
Deep tasks (3-4+ hop signal)72.4 AUROC83.2 AUROC+14.9%
Overall classification average75.83 AUROC81.14 AUROC+7.0%

The deeper the required reasoning, the larger the graph transformer advantage. Deep tasks show nearly 15% relative improvement.

How message passing works

A message-passing neural network (MPNN) operates in rounds. In each round, every node collects messages from its immediate neighbors, aggregates them (typically through sum, mean, or max pooling), and updates its own representation. After k rounds, each node has information from nodes up to k hops away.

For a relational database represented as a graph, this means: after 1 layer, a customer node knows about its own orders. After 2 layers, it knows about the products in those orders. After 3 layers, it knows about other customers who bought the same products. After 4 layers, it knows about the purchasing patterns of those other customers.

This sounds powerful, and it is. GNNs consistently outperform flat models like LightGBM on relational data because they automatically discover cross-table patterns. But three problems emerge as you push deeper.

Problem 1: Over-squashing

Consider a node with 10 neighbors, each of which has 10 neighbors. After 2 hops, information from 100 source nodes must be compressed into a single fixed-size vector at the target node. After 3 hops, it is 1,000 nodes. After 4 hops, 10,000.

Here is a concrete example. A bank wants to predict default risk for customer C-100. The relevant signal is 4 hops away:

over_squashing_example — 4-hop signal path

HopPathNodes at This HopGNN Signal LeftSignal
0C-100 (target customer)1100%Customer attributes
1C-100 -> their 8 transactions895%Purchase amounts, dates
2Transactions -> 5 merchants4072%Merchant risk categories
3Merchants -> 200 other customers' txns1,00018%Other customers' behavior
4Other customers -> their default status10,0003%Default rate signal

By hop 4, the GNN retains only 3% of the signal that C-100's merchants are frequented by customers who defaulted. A graph transformer attends directly to the default status nodes, retaining 82% of the signal.

This exponential growth means that by the time long-range signals reach the target, they have been averaged and compressed beyond usefulness. Alon and Yahav formalized this in their 2021 paper, showing that the information bandwidth between distant nodes decreases exponentially with distance in message-passing architectures.

In relational data, the most predictive patterns often span 3-4 tables. A customer's churn risk depends on the return rates of products they bought, the satisfaction scores of brands they prefer, and the behavior of similar customers. That is a 4-hop path. By the time this signal survives four rounds of aggregation, it is almost entirely diluted.

Problem 2: Over-smoothing

As you stack more GNN layers, node representations converge. Every node starts to look like every other node because they are all aggregating from the same global neighborhood. This is over-smoothing, and it means you cannot simply add more layers to capture longer-range patterns. Most production GNNs cap at 2-4 layers because deeper networks perform worse.

Research by Li, Han, and Wu (2018) demonstrated that GCN node representations converge to a fixed point exponentially fast with depth. At 8 layers, classification accuracy drops below what 2 layers achieve.

Problem 3: Uniform aggregation

Standard message passing treats all edges equally during aggregation. A customer's relationship to a high-value, frequently purchased product gets the same weight as a one-time low-value purchase. GAT (Graph Attention Networks) addressed this partially by learning attention weights over immediate neighbors, but the attention is still local. It can weight a node's direct connections but cannot assess the relative importance of nodes 3 hops away.

How graph transformers work

Graph transformers replace message passing with attention. Instead of aggregating information hop by hop, every node in a subgraph can attend directly to every other node. The attention mechanism learns which nodes are relevant to the prediction, regardless of their graph distance.

In a single layer, a customer node can attend to its orders, the products in those orders, other customers who bought those products, and their purchasing patterns. No compression through intermediate nodes. No signal degradation over hops. The model decides what matters and pulls it in directly.

The attention scores are learned functions of node features, edge types, and positional encodings that capture the graph structure. This means the model can weight a 4-hop relationship as strongly as a 1-hop relationship if the data supports it.

Positional and structural encodings

Standard transformers use positional encodings to capture sequence order. Graph transformers need analogous encodings for graph structure. Common approaches include:

  • Random walk encodings: The probability of reaching node B from node A via a random walk of length k. This captures local topology.
  • Laplacian eigenvectors: Spectral features that encode global graph structure, similar to how Fourier components capture signal structure.
  • Shortest path distances: The minimum number of hops between two nodes, providing a structural prior for the attention mechanism.
  • Temporal encodings: For relational databases, timestamps on rows provide a natural ordering. Graph transformers can encode time deltas between events, capturing patterns like acceleration, periodicity, and recency.

These encodings give the transformer the structural information that would otherwise be implicit in the message-passing topology, without the information bottleneck.

Handling heterogeneous graphs

Relational databases produce heterogeneous graphs: different node types (customers, orders, products, categories) and different edge types (purchased, contains, belongs_to). Graph transformers handle this through type-specific projection matrices. Each node type gets its own linear transformation into the shared attention space, and each edge type modulates the attention computation differently.

This is critical for relational data. A customer-to-order relationship carries fundamentally different semantics than an order-to-product relationship. Message-passing GNNs handle this through relation-specific message functions (R-GCN), but the per-relation attention in graph transformers is more expressive because it operates globally rather than locally.

Message passing (GNN)

  • Local: each node sees only its immediate neighbors per layer
  • Long-range requires stacking 4-6 layers
  • Over-squashing degrades signal over distance
  • Over-smoothing limits practical depth to 2-4 layers
  • 75.83 AUROC on RelBench classification tasks

Attention (graph transformer)

  • Global: each node attends to all nodes in its subgraph
  • Long-range captured in a single layer
  • No information bottleneck at intermediate nodes
  • Depth is not constrained by smoothing
  • 81.14 AUROC on RelBench classification tasks (fine-tuned)

The computational trade-off

Full self-attention over a graph is O(n²) in the number of nodes. For a graph with 10 million nodes, that is 100 trillion attention computations per layer. Obviously impractical.

Production systems solve this through subgraph sampling. Rather than attending over the entire graph, you extract a relevant subgraph around each target node (typically 500-5,000 nodes) and run attention within that subgraph. The subgraph captures the local neighborhood to 3-5 hops, which includes all the information that would be accessible to a GNN with that many layers.

Within these subgraphs, attention is feasible. A 2,000-node subgraph requires 4 million attention computations per layer, which runs in milliseconds on a modern GPU. The key insight is that you get the expressiveness of global attention with the computational cost bounded by the subgraph size, not the full graph size.

Additional optimizations include sparse attention (attending only to nodes connected within k hops), kernelized attention (approximating full attention in O(n) time), and efficient batching that processes multiple subgraphs simultaneously.

What the benchmarks show

The RelBench benchmark provides the clearest comparison. It spans 7 databases, 30 prediction tasks, and over 103 million rows. The databases range from 3 tables (Amazon product reviews) to 15 tables (clinical trials) and cover e-commerce, healthcare, sports, and academia.

On classification tasks, the results are consistent: LightGBM with manually engineered features achieves 62.44 AUROC. A supervised GNN (message passing with task-specific training) achieves 75.83 AUROC. KumoRFM zero-shot (graph transformer, no task-specific training) achieves 76.71 AUROC. KumoRFM fine-tuned achieves 81.14 AUROC.

The GNN already beats manual feature engineering by a wide margin, confirming that learning from graph structure outperforms human feature engineering. The graph transformer then adds another 5-7 points on top of the GNN, confirming that attention outperforms message passing.

The zero-shot result is particularly telling. KumoRFM, without any training on the target database, outperforms a GNN that was specifically trained on that database's tasks. This suggests that the patterns learned through pre-training with graph transformer architecture are genuinely universal across relational datasets.

PQL Query

PREDICT fraud_probability
FOR EACH transactions.txn_id

The graph transformer architecture behind KumoRFM captures 4-hop fraud ring patterns in a single forward pass. A GNN would need 4+ layers and lose signal at each hop.

Output

txn_idfraud_probpattern_depthtop_signal
T-900010.031-hopNormal merchant, typical amount
T-900020.913-hopShared device with flagged accounts
T-900030.784-hopPart of circular fund flow ring
T-900040.122-hopUnusual merchant but legitimate payee network

Why this matters for practitioners

If you are building ML on relational data, the architecture choice matters more than most hyperparameter decisions. A GNN with perfect hyperparameters will still hit the over-squashing ceiling. A graph transformer with reasonable defaults will capture signals that no amount of GNN tuning can access.

The practical implication is that long-range relational patterns (the ones that span 3-4 tables and carry the strongest predictive signal) are structurally inaccessible to message-passing architectures. These are exactly the patterns that separate good predictions from great ones: the fraud ring connected through three intermediaries, the churn signal propagating through a product-customer-product chain, the demand correlation between products that share suppliers three levels deep.

Graph transformers do not just score higher on benchmarks. They find qualitatively different patterns. And those patterns are the ones that drive the business outcomes that matter.

KumoRFM makes this accessible without requiring you to implement graph transformer architecture from scratch. It is pre-trained on billions of relational patterns, connects directly to your data warehouse, and generates predictions in seconds. The architecture is the reason it works. The pre-training is the reason it works without task-specific training.

Frequently asked questions

What is the difference between a GNN and a graph transformer?

A GNN uses message passing: each node aggregates information from its immediate neighbors, and stacking multiple layers lets information propagate further. A graph transformer uses attention: each node can attend directly to every other node in its subgraph in a single layer, weighting connections by learned relevance. This means graph transformers can capture long-range dependencies without the information bottleneck that limits GNNs.

What is over-squashing in graph neural networks?

Over-squashing occurs when information from distant nodes must pass through bottleneck nodes to reach a target. As the number of hops increases, the information from exponentially many source nodes gets compressed into fixed-size vectors at intermediate nodes. By the time a 4-hop signal reaches the target, it has been averaged and compressed so heavily that the original signal is effectively lost. Research by Alon and Yahav (2021) identified this as a fundamental limitation of message-passing architectures.

Why do graph transformers outperform GNNs on relational data?

Three reasons: (1) they avoid over-squashing by letting any node attend to any other node directly; (2) they capture long-range dependencies in a single layer that would require 4-6 GNN layers; and (3) their attention mechanism learns which relationships matter most, rather than treating all edges equally. On RelBench, this translates to roughly 10% higher accuracy on classification tasks compared to standard message-passing GNNs.

Are graph transformers computationally feasible at enterprise scale?

Full graph attention is O(n-squared), which is impractical for million-node graphs. Production systems solve this through subgraph sampling (extracting relevant local neighborhoods), sparse attention patterns (attending only to structurally important nodes), and efficient batching. KumoRFM uses these techniques to run inference on databases with tens of millions of rows in seconds.

How does KumoRFM use graph transformer architecture?

KumoRFM represents a relational database as a temporal heterogeneous graph (rows are nodes, foreign keys are edges, timestamps order events). It then applies graph transformer layers that attend across this structure, learning which cross-table patterns predict the target. Because it is pre-trained on billions of relational patterns across diverse databases, it can make predictions zero-shot on new databases without fine-tuning.

See it in action

KumoRFM delivers predictions on relational data in seconds. No feature engineering, no ML pipelines. Try it free.