Introduction
You’ve learned the Python GDS workflow using the Movies dataset. Now you’ll apply those skills to a real-world research problem: analyzing academic citation networks.
What You’ll Learn
By the end of this lesson, you’ll be able to:
-
Explain what citation networks reveal about scientific research
-
Load the Cora citation dataset into Neo4j
-
Create a graph projection configured for citation analysis
-
Describe the structure of the Cora dataset
From Movies to Research
In the previous lessons, you learned:
-
How to connect to Neo4j with the Python client
-
How to create graph projections with properties
-
How to run algorithms in different modes
-
How to chain algorithms together
Now you’ll use these skills on a new domain.
The Citation Network Problem
Academic research builds on previous work. Citation networks reveal:
-
Influential papers: Which research shaped entire fields?
-
Bridge papers: Which works connect different research areas?
-
Research communities: Which papers form natural clusters?
The Cora Dataset
A classic benchmark dataset for graph machine learning:
-
2,708 academic papers
-
10,556 citation relationships
-
7 subject areas
Subject Areas
-
Neural Networks
-
Reinforcement Learning
-
Theory
-
Genetic Algorithms
-
Case-Based Reasoning
-
Probabilistic Methods
-
Rule Learning
Graph Structure
| Element | Details |
|---|---|
Nodes |
|
Relationships |
|
Features |
1,433-dimensional vector representing words used in each paper |
What You’ll Accomplish
Over the next notebooks, you’ll:
-
Load the Cora citation dataset
-
Project the citation network into GDS
-
Run PageRank to find influential papers
-
Run Betweenness Centrality to find bridge papers
-
Detect communities to find research clusters
-
Combine results for comprehensive analysis
Loading the Dataset
node_load_q = """
LOAD CSV WITH HEADERS FROM
'https://raw.githubusercontent.com/.../node_list.csv' AS row
MERGE (p:Paper {paper_Id: toInteger(row.id)})
SET p.subject = row.subject,
p.features = apoc.convert.fromJsonList(row.features) # (1)
RETURN count(p) AS papers_loaded
"""
result = gds.run_cypher(node_load_q) # (2)
print(f"Loaded {result['papers_loaded'][0]} papers")-
apoc.convert.fromJsonList()parses the 1,433-dimensional feature vector from its JSON string representation -
Cypher stored in a variable and passed to
gds.run_cypher()— keeps complex queries readable
Loading Relationships
edge_load_q = """
LOAD CSV WITH HEADERS FROM
'https://raw.githubusercontent.com/.../edge_list.csv' AS row
MATCH (source:Paper {paper_Id: toInteger(row.source)}) # (1)
MATCH (target:Paper {paper_Id: toInteger(row.target)})
MERGE (source)-[r:CITES]->(target) # (2)
RETURN count(r) AS citations_loaded
"""
result = gds.run_cypher(edge_load_q)
print(f"Loaded {result['citations_loaded'][0]} citations")-
MATCHboth source and target Paper nodes first — they must already exist from the previous load step -
CITESis directed: Paper A → Paper B means A references B in its bibliography
Verify Your Data
After loading, verify the dataset:
df = gds.run_cypher("""
MATCH (p:Paper)
RETURN p.subject AS subject, count(*) AS count
ORDER BY count DESC
""")
display(df)You should see 7 subjects with papers distributed across them.
Node and Relationship Counts
df = gds.run_cypher("""
MATCH (p:Paper)
WITH count(p) AS papers
MATCH ()-[r:CITES]->()
RETURN papers, count(r) AS citations
""")
print(df)You should see: 2,708 papers and 10,556 citations.
Creating the Projection
G, result = gds.graph.project(
"cora-graph",
{
"Paper": {
"properties": ["subjectClass"] # (1)
}
},
{
"CITES": {
"orientation": "UNDIRECTED" # (2)
}
}
)-
subjectClassis projected so algorithms can use subject information — useful for validating community detection later -
UNDIRECTEDtreats each citation as bidirectional — required for algorithms like Louvain and many centrality measures
Inspecting the Projection
print(f"Projected graph: {G.name()}")
print(f" Nodes: {G.node_count():,}")
print(f" Relationships: {G.relationship_count():,}") # (1)
print(f" Memory usage: {G.memory_usage()}")
print(f" Density: {G.density():.6f}") # (2)-
With
UNDIRECTEDorientation, the relationship count doubles (each edge is stored in both directions) -
Density measures how connected the graph is — citation networks are typically sparse (low density)
Summary
You’re now ready to analyze citation networks:
-
Dataset: 2,708 papers, 10,556 citations, 7 research areas
-
Goal: Find influential papers, bridge papers, and research communities
-
Tools: The same Python GDS skills you’ve already learned
Next: Run PageRank to find influential papers.