Louvain Community Detection

Introduction

You’ve already used Louvain in the fraud module to detect fraud rings by optimizing modularity.

Graph showing papers as nodes

Louvain on Citations

Now we’ll apply the same algorithm to the citation network to discover research communities—groups of papers that cite each other more than they cite papers outside the group.

The question we’ll explore: do citation-based communities align with official subject labels?

What you’ll learn

By the end of this lesson, you’ll be able to:

  • Run Louvain community detection to find research communities

  • Analyze how detected communities relate to official subject labels

  • Combine community membership with centrality metrics for richer analysis

  • Interpret community statistics to identify influential and bridging communities

Setup: Retrieve the projection

First, we retrieve the graph projection we’ve been working with.

python
Retrieving the existing projection
G = gds.graph.get("cora-graph")

print(f"Graph '{G.name()}' ready with {G.node_count():,} nodes")

Running Louvain community detection

Now we run Louvain and write the community assignments to the database.

python
Running Louvain with write mode
louvain_result = gds.louvain.write(
    G,
    writeProperty='louvainCommunity',  # (1)
    maxLevels=10,  # (2)
    maxIterations=10
)

print(f"Detected {louvain_result['communityCount']} communities")
print(f"Modularity score: {louvain_result['modularity']:.4f}")  # (3)
  1. Each node gets a community ID stored in this property

  2. maxLevels controls the depth of hierarchical merging; Louvain recursively coarsens the graph

  3. Modularity ranges from -0.5 to 1.0; values above 0.3 indicate meaningful community structure

Understanding the results

The key metrics from our Louvain run:

Metric Value

modularity

0.82 (strong community structure)

communityCount

104 communities detected

ranLevels

4 hierarchical levels

mean community size

26 papers

A modularity of 0.82 indicates well-defined communities with dense internal connections.

Communities versus subject labels

The Cora dataset has official subject labels like Neural Networks, Theory, and Reinforcement Learning.

Louvain finds communities based purely on citation patterns, without knowing about these labels.

An interesting question is whether the detected communities align with the official labels.

What do these communities mean?

We found 104 communities, but what do they represent?

The Cora dataset has official subject labels assigned by humans. Comparing Louvain’s communities to these labels tells us whether citation behavior follows topical boundaries.

If communities align with subjects, papers mostly cite within their field. If they don’t, citation patterns reveal structure that formal labels miss.

Comparing communities with subjects

Let’s see which subjects appear in each community.

python
Analyzing community composition
q_community_subjects = """
    MATCH (p:Paper)
    WHERE p.louvainCommunity IS NOT NULL
    WITH p.louvainCommunity AS community,
         p.subject AS subject,
         count(*) AS count  // (1)
    RETURN community, subject, count
    ORDER BY community, count DESC  // (2)
    LIMIT 20
"""

df_comm_subj = gds.run_cypher(q_community_subjects)
pp(df_comm_subj.head(20))
  1. Counts papers per subject within each community to see the dominant topic

  2. Ordering by count DESC within each community puts the dominant subject first

Interpreting the composition

Most communities are dominated by a single subject:

Table 1. Sample Community Composition
community subject count

289

Case_Based

111

289

Rule_Learning

4

326

Probabilistic_Methods

88

326

Neural_Networks

17

622

Neural_Networks

126

622

Probabilistic_Methods

14

Each community has a clear dominant subject, with minor contributions from related fields.

What the composition reveals

The large communities (289, 326, 622) each have a dominant subject:

  • Community 289: 93% Case_Based (111 of 119 papers)

  • Community 326: 75% Probabilistic_Methods (88 of 117 papers)

  • Community 622: 90% Neural_Networks (126 of 140 papers)

Smaller communities (379, 592, 617) are pure—containing only a single subject.

This suggests citation patterns mostly follow subject boundaries, with papers primarily citing within their field.

Community statistics with centrality metrics

We can enrich our analysis by including the centrality metrics we computed earlier.

python
Community statistics with PageRank and Betweenness
q_community_stats = """
    MATCH (p:Paper)
    WHERE p.louvainCommunity IS NOT NULL
    WITH p.louvainCommunity AS community,
         collect(DISTINCT p.subject) AS subjects,
         count(*) AS size,
         avg(p.pageRank) AS avg_pageRank,  // (1)
         avg(p.betweenness) AS avg_betweenness
    RETURN community,
           size,
           size(subjects) AS num_subjects,  // (2)
           subjects,
           avg_pageRank,
           avg_betweenness
    ORDER BY size DESC
    LIMIT 10
"""

df_community_stats = gds.run_cypher(q_community_stats)
pp(df_community_stats)
  1. Aggregating centrality metrics per community reveals which communities are influential vs which are bridges

  2. size(subjects) counts how many distinct subjects appear — a measure of community diversity

Reading the community statistics

Communities with high average PageRank contain influential papers.

Communities with high average Betweenness serve as bridges between other communities.

Communities spanning many subjects are interdisciplinary research areas.

Top 8 Communities by Size

Table 2. Top 8 Communities by Size
community size num_subjects subjects avg_pageRank avg_betweenness

1136

329

6

Genetic_Algorithms, Reinforcement_Learning, Neural_Networks, …​

1.02

6,292

2018

195

6

Reinforcement_Learning, Probabilistic_Methods, …​

1.07

9,363

686

182

6

Case_Based, Theory, Rule_Learning, Neural_Networks, …​

1.01

8,293

795

177

7

Rule_Learning, Theory, Case_Based, Reinforcement_Learning, …​

0.98

5,858

745

172

5

Genetic_Algorithms, Neural_Networks, Reinforcement_Learning, …​

1.02

9,522

622

152

5

Neural_Networks, Probabilistic_Methods, Theory, …​

0.99

8,128

640

145

5

Theory, Reinforcement_Learning, Genetic_Algorithms, …​

1.01

5,426

2614

127

6

Rule_Learning, Theory, Neural_Networks, Case_Based, …​

1.01

5,554

Identifying key communities

Looking at the statistics, we can identify:

  • Most influential community: 2018 has the highest avg_pageRank (1.07)

  • Key bridging community: 745 has the highest avg_betweenness (9,522)

  • Most interdisciplinary: 795 spans 7 subjects

These communities would be good starting points for understanding cross-disciplinary research.

Visualizing community sizes

A bar chart helps us see the distribution of community sizes.

python
Creating a bar chart of community sizes
import matplotlib.pyplot as plt

q_community_sizes = """
    MATCH (p:Paper)
    WHERE p.louvainCommunity IS NOT NULL
    RETURN p.louvainCommunity AS community,
           count(*) AS size
    ORDER BY size DESC
    LIMIT 10
"""

df_sizes = gds.run_cypher(q_community_sizes)

plt.bar(df_sizes['community'].astype(str), df_sizes['size'])  # (1)
plt.xlabel('Community ID')
plt.ylabel('Number of Papers')
plt.title('Top 10 Research Communities by Size')
plt.show()
  1. Converting community IDs to strings ensures they display as categorical labels, not numeric axis values

Community sizes chart

Bar chart showing top 10 communities by size

Combining all three perspectives

You now have three complementary views of each paper:

  • PageRank tells you how influential a paper is

  • Betweenness tells you how much it connects different areas

  • Community tells you which cluster it belongs to

Combining these perspectives reveals patterns that no single metric could show alone.

Multi-metric analysis

With all three metrics, you can answer questions like:

  • Which paper is most influential in each community?

  • Which papers bridge between communities?

  • How do community-level averages compare across the network?

Multi-metric analysis examples

Multi-metric analysis example

Find the most influential paper in each of the top 5 communities:

python
Finding top papers per community
q = """
    MATCH (p:Paper)
    WHERE p.louvainCommunity IN [1136, 2018, 686, 795, 745]  // (1)
    WITH p.louvainCommunity AS community, p
    ORDER BY p.pageRank DESC
    WITH community, collect(p)[0] AS topPaper  // (2)
    RETURN community, topPaper.paper_Id, topPaper.pageRank
"""
gds.run_cypher(q)
  1. Filters to the top 5 communities identified in our earlier analysis

  2. Collects papers ordered by PageRank and takes the first (highest) from each community

This combines community membership with PageRank to answer a real question.

Summary

In this lesson, you:

  • Ran Louvain to detect 104 research communities from citation patterns

  • Compared communities to official subject labels and found strong alignment

  • Combined community membership with PageRank and Betweenness for richer analysis

  • Identified influential and bridging communities using multi-metric analysis

Next: Scale features for ML, use FastRP to create node embeddings and explore node similarity for recommendations.

Chatbot

How can I help you today?