Citation Networks

Introduction

You’ve learned the Python GDS workflow using the Movies dataset. Now you’ll apply those skills to a real-world research problem: analyzing academic citation networks.

What You’ll Learn

By the end of this lesson, you’ll be able to:

  • Explain what citation networks reveal about scientific research

  • Load the Cora citation dataset into Neo4j

  • Create a graph projection configured for citation analysis

  • Describe the structure of the Cora dataset

From Movies to Research

In the previous lessons, you learned:

  • How to connect to Neo4j with the Python client

  • How to create graph projections with properties

  • How to run algorithms in different modes

  • How to chain algorithms together

Now you’ll use these skills on a new domain.

The Citation Network Problem

Academic research builds on previous work. Citation networks reveal:

  • Influential papers: Which research shaped entire fields?

  • Bridge papers: Which works connect different research areas?

  • Research communities: Which papers form natural clusters?

Citation network diagram with highlighted influential and bridge papers.

The Cora Dataset

A classic benchmark dataset for graph machine learning:

  • 2,708 academic papers

  • 10,556 citation relationships

  • 7 subject areas

Subject Areas

  • Neural Networks

  • Reinforcement Learning

  • Theory

  • Genetic Algorithms

  • Case-Based Reasoning

  • Probabilistic Methods

  • Rule Learning

Graph Structure

Element Details

Nodes

Paper with properties: subject, features, subjectClass

Relationships

CITES (directed: Paper A → Paper B means A cites B)

Features

1,433-dimensional vector representing words used in each paper

What You’ll Accomplish

Over the next notebooks, you’ll:

  1. Load the Cora citation dataset

  2. Project the citation network into GDS

  3. Run PageRank to find influential papers

  4. Run Betweenness Centrality to find bridge papers

  5. Detect communities to find research clusters

  6. Combine results for comprehensive analysis

Loading the Dataset

python
Loading paper nodes
node_load_q = """
LOAD CSV WITH HEADERS FROM
  'https://raw.githubusercontent.com/.../node_list.csv' AS row
MERGE (p:Paper {paper_Id: toInteger(row.id)})
SET p.subject = row.subject,
    p.features = apoc.convert.fromJsonList(row.features) # (1)
RETURN count(p) AS papers_loaded
"""

result = gds.run_cypher(node_load_q) # (2)
print(f"Loaded {result['papers_loaded'][0]} papers")
  1. apoc.convert.fromJsonList() parses the 1,433-dimensional feature vector from its JSON string representation

  2. Cypher stored in a variable and passed to gds.run_cypher() — keeps complex queries readable

Loading Relationships

python
Loading citation relationships
edge_load_q = """
LOAD CSV WITH HEADERS FROM
  'https://raw.githubusercontent.com/.../edge_list.csv' AS row
MATCH (source:Paper {paper_Id: toInteger(row.source)}) # (1)
MATCH (target:Paper {paper_Id: toInteger(row.target)})
MERGE (source)-[r:CITES]->(target) # (2)
RETURN count(r) AS citations_loaded
"""

result = gds.run_cypher(edge_load_q)
print(f"Loaded {result['citations_loaded'][0]} citations")
  1. MATCH both source and target Paper nodes first — they must already exist from the previous load step

  2. CITES is directed: Paper A → Paper B means A references B in its bibliography

Verify Your Data

After loading, verify the dataset:

python
Checking papers by subject
df = gds.run_cypher("""
    MATCH (p:Paper)
    RETURN p.subject AS subject, count(*) AS count
    ORDER BY count DESC
""")
display(df)

You should see 7 subjects with papers distributed across them.

Node and Relationship Counts

python
Verifying counts
df = gds.run_cypher("""
    MATCH (p:Paper)
    WITH count(p) AS papers
    MATCH ()-[r:CITES]->()
    RETURN papers, count(r) AS citations
""")
print(df)

You should see: 2,708 papers and 10,556 citations.

Creating the Projection

python
Projecting the citation network
G, result = gds.graph.project(
    "cora-graph",
    {
        "Paper": {
            "properties": ["subjectClass"] # (1)
        }
    },
    {
        "CITES": {
            "orientation": "UNDIRECTED" # (2)
        }
    }
)
  1. subjectClass is projected so algorithms can use subject information — useful for validating community detection later

  2. UNDIRECTED treats each citation as bidirectional — required for algorithms like Louvain and many centrality measures

Inspecting the Projection

python
Checking projection details
print(f"Projected graph: {G.name()}")
print(f"  Nodes: {G.node_count():,}")
print(f"  Relationships: {G.relationship_count():,}") # (1)
print(f"  Memory usage: {G.memory_usage()}")
print(f"  Density: {G.density():.6f}") # (2)
  1. With UNDIRECTED orientation, the relationship count doubles (each edge is stored in both directions)

  2. Density measures how connected the graph is — citation networks are typically sparse (low density)

Summary

You’re now ready to analyze citation networks:

  • Dataset: 2,708 papers, 10,556 citations, 7 research areas

  • Goal: Find influential papers, bridge papers, and research communities

  • Tools: The same Python GDS skills you’ve already learned

Next: Run PageRank to find influential papers.

Chatbot

How can I help you today?