Introduction
You’ve already used Louvain in the fraud module to detect fraud rings by optimizing modularity.
Louvain on Citations
Now we’ll apply the same algorithm to the citation network to discover research communities—groups of papers that cite each other more than they cite papers outside the group.
The question we’ll explore: do citation-based communities align with official subject labels?
What you’ll learn
By the end of this lesson, you’ll be able to:
-
Run Louvain community detection to find research communities
-
Analyze how detected communities relate to official subject labels
-
Combine community membership with centrality metrics for richer analysis
-
Interpret community statistics to identify influential and bridging communities
Setup: Retrieve the projection
First, we retrieve the graph projection we’ve been working with.
G = gds.graph.get("cora-graph")
print(f"Graph '{G.name()}' ready with {G.node_count():,} nodes")Running Louvain community detection
Now we run Louvain and write the community assignments to the database.
louvain_result = gds.louvain.write(
G,
writeProperty='louvainCommunity', # (1)
maxLevels=10, # (2)
maxIterations=10
)
print(f"Detected {louvain_result['communityCount']} communities")
print(f"Modularity score: {louvain_result['modularity']:.4f}") # (3)-
Each node gets a community ID stored in this property
-
maxLevelscontrols the depth of hierarchical merging; Louvain recursively coarsens the graph -
Modularity ranges from -0.5 to 1.0; values above 0.3 indicate meaningful community structure
Understanding the results
The key metrics from our Louvain run:
| Metric | Value |
|---|---|
modularity |
0.82 (strong community structure) |
communityCount |
104 communities detected |
ranLevels |
4 hierarchical levels |
mean community size |
26 papers |
A modularity of 0.82 indicates well-defined communities with dense internal connections.
Communities versus subject labels
The Cora dataset has official subject labels like Neural Networks, Theory, and Reinforcement Learning.
Louvain finds communities based purely on citation patterns, without knowing about these labels.
An interesting question is whether the detected communities align with the official labels.
What do these communities mean?
We found 104 communities, but what do they represent?
The Cora dataset has official subject labels assigned by humans. Comparing Louvain’s communities to these labels tells us whether citation behavior follows topical boundaries.
If communities align with subjects, papers mostly cite within their field. If they don’t, citation patterns reveal structure that formal labels miss.
Comparing communities with subjects
Let’s see which subjects appear in each community.
q_community_subjects = """
MATCH (p:Paper)
WHERE p.louvainCommunity IS NOT NULL
WITH p.louvainCommunity AS community,
p.subject AS subject,
count(*) AS count // (1)
RETURN community, subject, count
ORDER BY community, count DESC // (2)
LIMIT 20
"""
df_comm_subj = gds.run_cypher(q_community_subjects)
pp(df_comm_subj.head(20))-
Counts papers per subject within each community to see the dominant topic
-
Ordering by count DESC within each community puts the dominant subject first
Interpreting the composition
Most communities are dominated by a single subject:
| community | subject | count |
|---|---|---|
289 |
Case_Based |
111 |
289 |
Rule_Learning |
4 |
326 |
Probabilistic_Methods |
88 |
326 |
Neural_Networks |
17 |
622 |
Neural_Networks |
126 |
622 |
Probabilistic_Methods |
14 |
Each community has a clear dominant subject, with minor contributions from related fields.
What the composition reveals
The large communities (289, 326, 622) each have a dominant subject:
-
Community 289: 93% Case_Based (111 of 119 papers)
-
Community 326: 75% Probabilistic_Methods (88 of 117 papers)
-
Community 622: 90% Neural_Networks (126 of 140 papers)
Smaller communities (379, 592, 617) are pure—containing only a single subject.
This suggests citation patterns mostly follow subject boundaries, with papers primarily citing within their field.
Community statistics with centrality metrics
We can enrich our analysis by including the centrality metrics we computed earlier.
q_community_stats = """
MATCH (p:Paper)
WHERE p.louvainCommunity IS NOT NULL
WITH p.louvainCommunity AS community,
collect(DISTINCT p.subject) AS subjects,
count(*) AS size,
avg(p.pageRank) AS avg_pageRank, // (1)
avg(p.betweenness) AS avg_betweenness
RETURN community,
size,
size(subjects) AS num_subjects, // (2)
subjects,
avg_pageRank,
avg_betweenness
ORDER BY size DESC
LIMIT 10
"""
df_community_stats = gds.run_cypher(q_community_stats)
pp(df_community_stats)-
Aggregating centrality metrics per community reveals which communities are influential vs which are bridges
-
size(subjects)counts how many distinct subjects appear — a measure of community diversity
Reading the community statistics
Communities with high average PageRank contain influential papers.
Communities with high average Betweenness serve as bridges between other communities.
Communities spanning many subjects are interdisciplinary research areas.
Top 8 Communities by Size
| community | size | num_subjects | subjects | avg_pageRank | avg_betweenness |
|---|---|---|---|---|---|
1136 |
329 |
6 |
Genetic_Algorithms, Reinforcement_Learning, Neural_Networks, … |
1.02 |
6,292 |
2018 |
195 |
6 |
Reinforcement_Learning, Probabilistic_Methods, … |
1.07 |
9,363 |
686 |
182 |
6 |
Case_Based, Theory, Rule_Learning, Neural_Networks, … |
1.01 |
8,293 |
795 |
177 |
7 |
Rule_Learning, Theory, Case_Based, Reinforcement_Learning, … |
0.98 |
5,858 |
745 |
172 |
5 |
Genetic_Algorithms, Neural_Networks, Reinforcement_Learning, … |
1.02 |
9,522 |
622 |
152 |
5 |
Neural_Networks, Probabilistic_Methods, Theory, … |
0.99 |
8,128 |
640 |
145 |
5 |
Theory, Reinforcement_Learning, Genetic_Algorithms, … |
1.01 |
5,426 |
2614 |
127 |
6 |
Rule_Learning, Theory, Neural_Networks, Case_Based, … |
1.01 |
5,554 |
Identifying key communities
Looking at the statistics, we can identify:
-
Most influential community: 2018 has the highest avg_pageRank (1.07)
-
Key bridging community: 745 has the highest avg_betweenness (9,522)
-
Most interdisciplinary: 795 spans 7 subjects
These communities would be good starting points for understanding cross-disciplinary research.
Visualizing community sizes
A bar chart helps us see the distribution of community sizes.
import matplotlib.pyplot as plt
q_community_sizes = """
MATCH (p:Paper)
WHERE p.louvainCommunity IS NOT NULL
RETURN p.louvainCommunity AS community,
count(*) AS size
ORDER BY size DESC
LIMIT 10
"""
df_sizes = gds.run_cypher(q_community_sizes)
plt.bar(df_sizes['community'].astype(str), df_sizes['size']) # (1)
plt.xlabel('Community ID')
plt.ylabel('Number of Papers')
plt.title('Top 10 Research Communities by Size')
plt.show()-
Converting community IDs to strings ensures they display as categorical labels, not numeric axis values
Community sizes chart
Combining all three perspectives
You now have three complementary views of each paper:
-
PageRank tells you how influential a paper is
-
Betweenness tells you how much it connects different areas
-
Community tells you which cluster it belongs to
Combining these perspectives reveals patterns that no single metric could show alone.
Multi-metric analysis
With all three metrics, you can answer questions like:
-
Which paper is most influential in each community?
-
Which papers bridge between communities?
-
How do community-level averages compare across the network?
Multi-metric analysis example
Find the most influential paper in each of the top 5 communities:
q = """
MATCH (p:Paper)
WHERE p.louvainCommunity IN [1136, 2018, 686, 795, 745] // (1)
WITH p.louvainCommunity AS community, p
ORDER BY p.pageRank DESC
WITH community, collect(p)[0] AS topPaper // (2)
RETURN community, topPaper.paper_Id, topPaper.pageRank
"""
gds.run_cypher(q)-
Filters to the top 5 communities identified in our earlier analysis
-
Collects papers ordered by PageRank and takes the first (highest) from each community
This combines community membership with PageRank to answer a real question.
Summary
In this lesson, you:
-
Ran Louvain to detect 104 research communities from citation patterns
-
Compared communities to official subject labels and found strong alignment
-
Combined community membership with PageRank and Betweenness for richer analysis
-
Identified influential and bridging communities using multi-metric analysis
Next: Scale features for ML, use FastRP to create node embeddings and explore node similarity for recommendations.