Citation Networks

Introduction

You’ve learned the Python GDS workflow using the Movies dataset. Now you’ll apply those skills to a real-world research problem: analyzing academic citation networks.

What You’ll Learn

By the end of this lesson, you’ll be able to:

Explain what citation networks reveal about scientific research
Load the Cora citation dataset into Neo4j
Create a graph projection configured for citation analysis
Describe the structure of the Cora dataset

From Movies to Research

In the previous lessons, you learned:

How to connect to Neo4j with the Python client
How to create graph projections with properties
How to run algorithms in different modes
How to chain algorithms together

Now you’ll use these skills on a new domain.

The Citation Network Problem

Academic research builds on previous work. Citation networks reveal:

Influential papers: Which research shaped entire fields?
Bridge papers: Which works connect different research areas?
Research communities: Which papers form natural clusters?

Citation network diagram with highlighted influential and bridge papers.

The Cora Dataset

A classic benchmark dataset for graph machine learning:

2,708 academic papers
10,556 citation relationships
7 subject areas

Subject Areas

Neural Networks
Reinforcement Learning
Theory
Genetic Algorithms
Case-Based Reasoning
Probabilistic Methods
Rule Learning

Graph Structure

Element Details

Element	Details
Nodes	`Paper` with properties: `subject`, `features`, `subjectClass`
Relationships	`CITES` (directed: Paper A → Paper B means A cites B)
Features	1,433-dimensional vector representing words used in each paper

Nodes

Paper with properties: subject, features, subjectClass

Relationships

CITES (directed: Paper A → Paper B means A cites B)

Features

1,433-dimensional vector representing words used in each paper

What You’ll Accomplish

Over the next notebooks, you’ll:

Load the Cora citation dataset
Project the citation network into GDS
Run PageRank to find influential papers
Run Betweenness Centrality to find bridge papers
Detect communities to find research clusters
Combine results for comprehensive analysis

Loading the Dataset

python

Loading paper nodes

node_load_q = """
LOAD CSV WITH HEADERS FROM
  'https://raw.githubusercontent.com/.../node_list.csv' AS row
MERGE (p:Paper {paper_Id: toInteger(row.id)})
SET p.subject = row.subject,
    p.features = apoc.convert.fromJsonList(row.features) # (1)
RETURN count(p) AS papers_loaded
"""

result = gds.run_cypher(node_load_q) # (2)
print(f"Loaded {result['papers_loaded'][0]} papers")

apoc.convert.fromJsonList() parses the 1,433-dimensional feature vector from its JSON string representation
Cypher stored in a variable and passed to gds.run_cypher() — keeps complex queries readable

Loading Relationships

python

Loading citation relationships

edge_load_q = """
LOAD CSV WITH HEADERS FROM
  'https://raw.githubusercontent.com/.../edge_list.csv' AS row
MATCH (source:Paper {paper_Id: toInteger(row.source)}) # (1)
MATCH (target:Paper {paper_Id: toInteger(row.target)})
MERGE (source)-[r:CITES]->(target) # (2)
RETURN count(r) AS citations_loaded
"""

result = gds.run_cypher(edge_load_q)
print(f"Loaded {result['citations_loaded'][0]} citations")

MATCH both source and target Paper nodes first — they must already exist from the previous load step
CITES is directed: Paper A → Paper B means A references B in its bibliography

Verify Your Data

After loading, verify the dataset:

python

Checking papers by subject

df = gds.run_cypher("""
    MATCH (p:Paper)
    RETURN p.subject AS subject, count(*) AS count
    ORDER BY count DESC
""")
display(df)

You should see 7 subjects with papers distributed across them.

Node and Relationship Counts

python

Verifying counts

df = gds.run_cypher("""
    MATCH (p:Paper)
    WITH count(p) AS papers
    MATCH ()-[r:CITES]->()
    RETURN papers, count(r) AS citations
""")
print(df)

You should see: 2,708 papers and 10,556 citations.

Creating the Projection

python

Projecting the citation network

G, result = gds.graph.project(
    "cora-graph",
    {
        "Paper": {
            "properties": ["subjectClass"] # (1)
        }
    },
    {
        "CITES": {
            "orientation": "UNDIRECTED" # (2)
        }
    }
)

subjectClass is projected so algorithms can use subject information — useful for validating community detection later
UNDIRECTED treats each citation as bidirectional — required for algorithms like Louvain and many centrality measures

Inspecting the Projection

python

Checking projection details

print(f"Projected graph: {G.name()}")
print(f"  Nodes: {G.node_count():,}")
print(f"  Relationships: {G.relationship_count():,}") # (1)
print(f"  Memory usage: {G.memory_usage()}")
print(f"  Density: {G.density():.6f}") # (2)

With UNDIRECTED orientation, the relationship count doubles (each edge is stored in both directions)
Density measures how connected the graph is — citation networks are typically sparse (low density)

Summary

You’re now ready to analyze citation networks:

Dataset: 2,708 papers, 10,556 citations, 7 research areas
Goal: Find influential papers, bridge papers, and research communities
Tools: The same Python GDS skills you’ve already learned

Next: Run PageRank to find influential papers.

Graph Data Science in Practice

GDS Foundations

Community Detection for Fraud

GDS Python Client

Aura Graph Analytics

Citation Networks

Introduction

What You’ll Learn

From Movies to Research

The Citation Network Problem

The Cora Dataset

Subject Areas

Graph Structure

What You’ll Accomplish

Loading the Dataset

Loading Relationships

Verify Your Data

Node and Relationship Counts

Creating the Projection

Inspecting the Projection

Summary

Chatbot