Betweenness Centrality

Introduction

In the previous lesson, we used PageRank to find influential papers in our citation network.

Now we’ll explore a different question: Which papers serve as bridges between different research areas?

Betweenness Centrality identifies these connectors—nodes that link different parts of the network together.

What you’ll learn

By the end of this lesson, you’ll be able to:

  • Explain what Betweenness Centrality measures and how it differs from PageRank

  • Run Betweenness Centrality using the Python GDS client

  • Compare influence (PageRank) with connectivity (Betweenness) using visualisation

  • Identify papers that bridge multiple research areas

What Betweenness measures

Betweenness Centrality measures how often a node lies on the shortest paths between other nodes in the network.

Betweenness Centrality concept

High Betweenness means high connectivity

If a node has high Betweenness, it means many shortest paths pass through it.

This makes the node a bottleneck or bridge in the network. If you removed it, many nodes would become harder to reach from each other.

Network diagram with a highlighted node acting as a bridge between two clusters.

Betweenness in a social network

Think about a social network for a moment.

A person with high PageRank is like a celebrity—everyone knows who they are.

A person with high Betweenness is like a connector—they introduce people from different social circles to each other.

Celebrity versus connector in a network

Both types matter

Both roles are valuable, but for different reasons.

The celebrity has influence. The connector enables information flow across the network.

In our citation network, we want to find the papers that play the connector role.

Celebrity versus connector in a network

How Betweenness is calculated

For each node in the graph, the algorithm:

  1. Finds all shortest paths between every pair of nodes

  2. Counts how many of those paths pass through the node

  3. Divides that count by the total number of paths

Betweenness calculation steps

Computational cost

Betweenness is more computationally expensive than PageRank because it needs to compute shortest paths between many pairs of nodes.

For the Cora dataset with 2,708 nodes, this completes quickly. For larger graphs, you may need to use sampling (we’ll cover this later).

Setup: Retrieve the projection

First, we retrieve the graph projection we created earlier.

python
Retrieving the existing projection
G = gds.graph.get("cora-graph")

print(f"Graph '{G.name()}' loaded:")
print(f"  {G.node_count():,} nodes")
print(f"  {G.relationship_count():,} relationships")

Running Betweenness Centrality

Now we can run Betweenness Centrality and write the results to the database.

python
Running Betweenness with write mode
bc_result = gds.betweenness.write(  # (1)
    G,
    writeProperty='betweenness'  # (2)
)

print(f"Computed Betweenness for {bc_result['nodePropertiesWritten']:,} papers")
  1. Unlike PageRank, Betweenness has no damping factor or iteration parameters — it computes exact results by default

  2. The property name where each node’s Betweenness score will be stored

Inspecting the distribution

GDS automatically provides distribution statistics in the result object.

centralityDistribution computeMillis configuration nodePropertiesWritten postProcessingMillis preProcessingMillis writeMillis

{'max': 850,664,
'mean': 6,053,
'min': 0,
'p50': 942,
'p75': 4,729,
'p90': 12,956,
'p95': 23,651,
'p99': 72,896,
'p999': 324,664}

180

{'concurrency': 4,
'writeConcurrency': 4,
'writeProperty': 'betweenness',
'nodeLabels': [''],
'relationshipTypes': ['
'],
'logProgress': True,
'sudo': False,
'writeToResultStore': False}

2,708

23

0

9

Key parameters

Here are the main parameters you can configure when running Betweenness Centrality:

Parameter Default Description

writeProperty

(required)

Property name for storing results

samplingSize

all nodes

Number of source nodes for approximation

samplingSeed

random

Seed for reproducible sampling

Exact versus approximate

By default, Betweenness computes exact results using all nodes as sources.

Flowchart showing decision process for exact vs. approximate Betweenness calculations.

For large graphs, you can use sampling to get approximate results much faster. We’ll look at this option later in the lesson.

Finding bridge papers

With Betweenness scores written to the database, we can query for the top bridge papers.

python
Querying top bridge papers
q_top_betweenness = """
    MATCH (p:Paper)
    WHERE p.betweenness IS NOT NULL
    RETURN
        p.paper_Id AS paperId,
        p.subject AS subject,
        p.betweenness AS betweenness,
        p.pageRank AS pageRank  // (1)
    ORDER BY p.betweenness DESC
    LIMIT 10
"""

df_bridges = gds.run_cypher(q_top_betweenness)  # (2)
display(df_bridges)
  1. Including PageRank alongside Betweenness lets us compare influence vs connectivity for each paper

  2. gds.run_cypher() returns results as a pandas DataFrame for easy analysis

Interpreting the results

Notice how the top papers by Betweenness are often different from the top papers by PageRank.

High Betweenness papers act as connectors between research communities. These are often methodological papers whose techniques apply across multiple domains.

Interpreting Betweenness scores

Betweenness scores are relative to your specific graph:

  • A score of zero means no shortest paths pass through this node (often leaf nodes)

  • Low scores indicate regular nodes with few paths passing through them

  • High scores indicate bridges or bottlenecks

You should only compare scores within the same graph. A score of 500 means very different things in different networks.

Comparing PageRank and Betweenness

Let’s visualise the relationship between influence (PageRank) and connectivity (Betweenness).

This helps us understand the different types of important papers in the network.

python
Querying papers with both metrics
q_both_metrics = """
    MATCH (p:Paper)
    WHERE p.pageRank IS NOT NULL
      AND p.betweenness IS NOT NULL  // (1)
    RETURN p.pageRank AS pageRank,
           p.betweenness AS betweenness,
           p.subject AS subject
"""

df_metrics = gds.run_cypher(q_both_metrics)
  1. Filtering for nodes with both metrics ensures we only compare papers that have been scored by both algorithms

Creating a scatter plot

Now we can create a scatter plot to see how the two metrics relate.

python
Creating a scatter plot of PageRank vs Betweenness
import matplotlib.pyplot as plt

plt.figure(figsize=(10, 6))
plt.scatter(df_metrics['pageRank'],
            df_metrics['betweenness'],
            alpha=0.5, s=10)  # (1)
plt.xlabel('PageRank (Influence)')
plt.ylabel('Betweenness Centrality (Connectivity)')  # (2)
plt.title('PageRank vs Betweenness in Citation Network')
plt.grid(True, alpha=0.3)
plt.show()
  1. Low alpha and small point size prevent overplotting when thousands of papers overlap

  2. Labeling axes with the metric interpretation (not just the name) helps readers understand the plot

Bottom-left: Regular papers

Most papers cluster in the bottom-left quadrant. These have low influence and low connectivity.

This is expected—the majority of papers in any network are not exceptional by either measure.

Top-right: Superstar papers

Papers in the top-right quadrant are both influential AND highly connected.

These are rare papers that are widely cited and also serve as bridges between research areas.

Top-left: Bridge papers

Papers in the top-left quadrant have high connectivity but moderate influence — none here.

These are true bridge papers—they connect different research areas even though they may not be the most famous works.

Bottom-right: Influential specialists

Papers in the bottom-right quadrant are influential but not strongly connecting different areas.

These are important papers within their specific field, but they don’t bridge to other research communities.

Combining algorithms reveals richer insights

Different algorithms reveal different types of importance in your network.

Using PageRank alone, you would miss the bridge papers. Using Betweenness alone, you would miss the influential specialists.

Combining them gives you a more complete picture.

Cross-subject bridge analysis

Let’s find out which research subjects produce the most bridge papers.

python
Analysing bridge papers by subject
q_subject_bridges = """
    MATCH (p:Paper)
    WHERE p.betweenness > 1000  // (1)
    RETURN
        p.subject AS subject,
        count(p) AS num_bridge_papers,
        avg(p.betweenness) AS avg_betweenness,
        avg(p.pageRank) AS avg_pageRank  // (2)
    ORDER BY num_bridge_papers DESC
"""

df_subject_bridges = gds.run_cypher(q_subject_bridges)
display(df_subject_bridges)
  1. Threshold of 1000 selects papers with meaningful bridge roles; adjust based on your graph’s score distribution

  2. Including PageRank alongside Betweenness shows whether bridge subjects are also influential

Interpreting subject analysis

Subjects with many high-Betweenness papers tend to be methodological fields that influence multiple domains.

These are the research areas where techniques and ideas flow outward to other communities.

Table 1. Bridge Papers by Research Subject
subject num_bridge_papers avg_betweenness avg_pageRank

Neural_Networks

401

11,854

1.24

Theory

206

12,227

1.37

Probabilistic_Methods

205

13,763

1.31

Genetic_Algorithms

188

14,365

1.45

Case_Based

159

9,407

1.24

Reinforcement_Learning

108

13,188

1.44

Rule_Learning

73

6,838

1.22

Finding specific cross-subject bridges

We can also find papers that specifically cite across multiple different research areas.

These are the truly interdisciplinary works.

Querying cross-subject bridges

This query finds papers with high Betweenness that cite at least three different subjects.

python
Finding papers that bridge multiple research areas
q_cross_subject = """
    MATCH (p:Paper)-[:CITES]->(cited:Paper)
    WHERE p.betweenness > 500
      AND p.subject <> cited.subject  // (1)
    WITH p,
         collect(DISTINCT cited.subject) AS cited_subjects,
         count(DISTINCT cited.subject) AS num_subjects_cited
    WHERE num_subjects_cited >= 3  // (2)
    RETURN
        p.paper_Id AS paperId,
        p.subject AS subject,
        p.betweenness AS betweenness,
        p.pageRank AS pageRank,
        num_subjects_cited,
        cited_subjects  // (3)
    ORDER BY num_subjects_cited DESC, p.betweenness DESC
    LIMIT 10
"""

df_cross_subject = gds.run_cypher(q_cross_subject)
pp(df_cross_subject)
  1. Filters to cross-subject citations only, ignoring within-field references

  2. Requires at least 3 different subjects cited — true interdisciplinary breadth

  3. Returns the list of cited subjects so you can see exactly which fields each paper bridges

Truly interdisciplinary papers

These papers have both structural importance (high Betweenness) and actually cite work from three or more different fields.

They serve as knowledge transfer hubs, synthesising ideas from across the research landscape.

Table 2. Papers Bridging Multiple Research Areas
paperId subject betweenness pageRank num_subjects_cited cited_subjects

28254

Neural_Networks

8,692

1.75

4

Case_Based, Reinforcement_Learning, Theory, Probabilistic_Methods

299197

Neural_Networks

82,818

1.08

3

Genetic_Algorithms, Reinforcement_Learning, Probabilistic_Methods

35797

Neural_Networks

78,154

2.90

3

Probabilistic_Methods, Theory, Rule_Learning

46431

Neural_Networks

30,303

0.95

3

Case_Based, Theory, Genetic_Algorithms

1104449

Neural_Networks

18,376

0.81

3

Theory, Case_Based, Probabilistic_Methods

…​

…​

…​

…​

…​

…​

Sampling for large graphs

For large graphs with millions of nodes, exact Betweenness computation can be slow.

You can use sampling to get approximate results much faster.

Using the samplingSize parameter

The samplingSize parameter limits how many source nodes are used in the calculation.

python
Running Betweenness with sampling
bc_sampled_result = gds.betweenness.write(
    G,
    writeProperty='betweenness_sampled',
    samplingSize=500,  # (1)
    samplingSeed=42  # (2)
)

print(f"Computed sampled Betweenness for {bc_sampled_result['nodePropertiesWritten']:,} papers")
  1. Uses only 500 source nodes instead of all 2,708 — trades accuracy for speed on large graphs

  2. Fixed seed ensures reproducible results across runs

Comparing exact and sampled results

You can compare exact and sampled results to see how close the approximation is.

python
Comparing exact vs sampled Betweenness
q_compare = """
    MATCH (p:Paper)
    WHERE p.betweenness IS NOT NULL
      AND p.betweenness_sampled IS NOT NULL  // (1)
    RETURN p.paper_Id AS paperId,
           p.betweenness AS exact,
           p.betweenness_sampled AS sampled
    ORDER BY p.betweenness DESC  // (2)
    LIMIT 10
"""

df_compare = gds.run_cypher(q_compare)
display(df_compare)
  1. Both properties must exist to compare — only papers scored by both runs are included

  2. Ordering by exact scores lets you check whether the top bridges are preserved by sampling

Compare sampled vs unsampled results

Table 3. Top 10 papers: Exact vs Sampled Betweenness
paperId exact sampled

35

850,663

135,989

3229

460,318

84,396

4330

324,663

67,198

1365

310,253

69,272

6213

278,922

57,619

887

247,254

54,440

1272

230,426

35,196

910

193,210

34,932

3231

182,882

32,799

6214

149,405

29,118

When to use sampling

Consider using sampling when:

  • Your graph has millions of nodes

  • Exact computation takes too long for your use case

  • Approximate results are acceptable for your analysis

For the Cora dataset with 2,708 nodes, exact computation is fast enough that sampling isn’t necessary.

On larger graphs, sampling can significantly reduce computation time while still identifying the most important bridge nodes.

Performance considerations

Betweenness is more expensive than PageRank for a few reasons:

  • It computes shortest paths between many node pairs

  • Memory usage increases with the level of parallelism

  • Undirected graphs are more expensive than directed graphs

Handling memory issues

If you run into memory problems on large graphs, you can:

  • Reduce the concurrency parameter to use fewer parallel threads

  • Use samplingSize to compute approximate results

  • Consider whether you need exact Betweenness or if another centrality measure would suffice

Disconnected graphs

Betweenness only counts paths that actually exist in the graph.

If your graph has isolated components, nodes may have zero Betweenness even if they are central within their own component.

Disconnected graph with isolated components

Betweenness relativity

A Betweenness score of 5000 means nothing without context.

You should only compare scores within the same graph.

The same score in a different network would have a completely different meaning.

Illustration showing importance of context in Betweenness score interpretation.

Common pitfall: Expecting similarity to PageRank

High PageRank does not imply high Betweenness, and vice versa.

Comparison chart illustrating differences between PageRank and Betweenness Centrality.

They measure fundamentally different things: Influence versus connectivity.

Summary

Betweenness Centrality identifies bridges and bottlenecks in your network:

  • It measures how often a node lies on shortest paths between other nodes

  • High-Betweenness nodes are connectors that link different parts of the network

  • It complements PageRank by revealing a different type of importance: Connectivity rather than influence

  • For large graphs, use the samplingSize parameter to get approximate results faster

  • Always compare scores within the same graph, never across different networks

You’ve now used both PageRank and Betweenness to analyse the citation network, finding both influential papers and bridge papers.

Next: Use Louvain community detection to discover research clusters in the network.

Chatbot

How can I help you today?