Import to Neo4j

Your graph awaits

The normalized CSV files are ready. This lesson walks through creating a Neo4j Aura Free instance, connecting to it from Python, and importing the data as a graph.

Open 2.9_import_to_neo4j.ipynb in your Codespace to follow along.

What you’ll learn

By the end of this lesson, you’ll be able to:

  • Create a Neo4j Aura Free instance

  • Connect to it from Python using the Neo4j driver

  • Create constraints for correct MERGE behavior

  • Import node and relationship CSVs in batches

  • Query the resulting metadata graph

Create an Aura Free instance

Go to console.neo4j.io and sign up or sign in. Click Create Instance.

Create instance button in the Aura console

Select AuraDB Free

Select AuraDB Free from the tier options. No credit card required.

Selecting the AuraDB Free tier

Save your credentials

A modal will appear with your username and generated password. Click Download and continue to save the credentials file.

Credentials modal showing username and password

Keep this file safe — you’ll need the URI, username, and password in the next step.

Wait for the instance to start

The instance status will change from Creating to Running. This usually takes under a minute.

Instance status showing Running

Add credentials to your environment

Add the connection details to a .env file in the project root:

text
NEO4J_URI=neo4j+s://xxxxxxxx.databases.neo4j.io
NEO4J_USERNAME=neo4j
NEO4J_PASSWORD=your-generated-password

The .env file is in .gitignore — your credentials won’t be committed.

Connect from Python

In your notebook, run the setup cells to install the driver and connect.

python
Connecting to Aura
from neo4j import GraphDatabase

driver = GraphDatabase.driver(
    URI, auth=(USERNAME, PASSWORD)
)
driver.verify_connectivity()  // (1)
  1. verify_connectivity() confirms that the driver can reach the instance and authenticate. If it fails, check your URI and password in the .env file.

Constraints

Constraints ensure that MERGE matches existing nodes rather than creating duplicates. Each constraint creates a uniqueness requirement and a backing index.

In your notebook, run the constraints cell. The four constraints match on doc_id for emails and _norm properties for users, mailboxes, and domains.

Import

Each CSV is imported in batches using UNWIND. Nodes MERGE on the normalized value. The raw value is stored as a property via ON CREATE SET.

In your notebook, run the import cells in order: emails, senders, recipients, CC, USED, HAS_MAILBOX.

Verify

Run the verification cell to check node and relationship counts. You should see:

  • ~4,900 Email nodes

  • ~5,400+ User nodes

  • ~5,500+ Mailbox nodes

  • ~1,000+ Domain nodes

  • SENT, RECEIVED, CC_ON, USED, and HAS_MAILBOX relationships

Investigate

The graph is live. These queries demonstrate what the metadata model makes possible.

cypher
Top senders
MATCH (m:Mailbox)-[:SENT]->(e:Email)
RETURN m.address AS sender,
       count(e) AS emails
ORDER BY emails DESC LIMIT 10
cypher
Cross-domain communication
MATCH (d1:Domain)-[:HAS_MAILBOX]->
      (m1:Mailbox)-[:SENT]->(e:Email)
      <-[:RECEIVED]-(m2:Mailbox)
      <-[:HAS_MAILBOX]-(d2:Domain)
WHERE d1 <> d2
RETURN d1.name AS from_domain,
       d2.name AS to_domain,
       count(e) AS emails
ORDER BY emails DESC LIMIT 10

What the metadata graph can answer

Adapt the schema to your own data

The schema here — Email, User, Mailbox, Domain — is designed for the Enron email corpus. If your data is different (legal documents, customer support tickets, research papers), your node labels and relationships will be different too. The import pattern is the same: define constraints on unique identifiers, MERGE on normalised values, store raw values as properties. Design your schema around the questions you want to answer, then adapt the Cypher queries accordingly.

The graph captures who sent what to whom and when:

  • Who are the most connected people in the network?

  • Which domains communicate most with Enron?

  • Who bridges between different groups?

  • What’s the communication pattern around a specific date or event?

What it can’t answer: what they talked about. The body text is still unstructured — names, organizations, topics, and locations mentioned in the content haven’t been extracted. That’s the next course, Entity Extraction: Communication Networks.

Check your understanding

Why constraints matter

What happens if you run MERGE without a uniqueness constraint on the property you’re matching?

  • ❏ Nothing — MERGE works the same either way

  • ❏ Neo4j raises an error and refuses to import

  • ✓ MERGE degrades to CREATE — every row creates a new node instead of matching existing ones, resulting in duplicates

  • ❏ MERGE automatically creates an index

Hint

MERGE needs to find existing nodes efficiently. What does a uniqueness constraint provide besides preventing duplicates?

Solution

A uniqueness constraint creates a backing index. Without it, MERGE has to scan all nodes to check for matches — and on large datasets, this is so slow that it effectively creates a new node every time. Constraints must be in place before importing any data.

Merging on normalized values

The import uses MERGE (u:User {name_norm: row.name_norm}) with ON CREATE SET u.name = row.name. What does this achieve?

  • ❏ It stores only the normalized name on the node

  • ❏ It creates a new node for every row in the CSV

  • ✓ It creates one node per unique normalized name, with the raw name stored as a property for traceability

  • ❏ It updates the raw name on every merge, overwriting previous values

Hint

What does ON CREATE SET do — does it run every time, or only when a new node is created?

Solution

MERGE matches on name_norm. If a node with that normalized name already exists, it’s reused. If not, a new node is created and ON CREATE SET stores the raw name. Subsequent merges with the same name_norm but different raw names reuse the existing node without overwriting — the first raw value is preserved.

Summary

  • Created an Aura Free instance at console.neo4j.io — no credit card required

  • Connected via the Python driver using credentials stored in .env

  • Constraints on _norm properties ensure MERGE matches correctly

  • Each CSV imported in batches using UNWIND — nodes merged on normalized values, raw values stored as properties

  • The graph contains Email, User, Mailbox, and Domain nodes connected by SENT, RECEIVED, CC_ON, USED, and HAS_MAILBOX relationships

  • Metadata queries traverse relationships that would require complex joins in a flat database

Course complete. The next course, Entity Extraction: Communication Networks, extracts entities and topics from the document body text.

Companion notebook: 2.9_import_to_neo4j.ipynb

Chatbot

How can I help you today?

Data Model

Your data model will appear here.