The normalized CSV files are ready. This lesson walks through creating a Neo4j Aura Free instance, connecting to it from Python, and importing the data as a graph.
Open 2.9_import_to_neo4j.ipynb in your Codespace to follow along.
What you’ll learn
By the end of this lesson, you’ll be able to:
Create a Neo4j Aura Free instance
Connect to it from Python using the Neo4j driver
Create constraints for correct MERGE behavior
Import node and relationship CSVs in batches
Query the resulting metadata graph
Create an Aura Free instance
Go to console.neo4j.io and sign up or sign in. Click Create Instance.
Select AuraDB Free
Select AuraDB Free from the tier options. No credit card required.
Save your credentials
A modal will appear with your username and generated password. Click Download and continue to save the credentials file.
Keep this file safe — you’ll need the URI, username, and password in the next step.
Wait for the instance to start
The instance status will change from Creating to Running. This usually takes under a minute.
Add credentials to your environment
Add the connection details to a .env file in the project root:
verify_connectivity() confirms that the driver can reach the instance and authenticate. If it fails, check your URI and password in the .env file.
Constraints
Constraints ensure that MERGE matches existing nodes rather than creating duplicates. Each constraint creates a uniqueness requirement and a backing index.
In your notebook, run the constraints cell. The four constraints match on doc_id for emails and _norm properties for users, mailboxes, and domains.
Import
Each CSV is imported in batches using UNWIND. Nodes MERGE on the normalized value. The raw value is stored as a property via ON CREATE SET.
In your notebook, run the import cells in order: emails, senders, recipients, CC, USED, HAS_MAILBOX.
Verify
Run the verification cell to check node and relationship counts. You should see:
~4,900 Email nodes
~5,400+ User nodes
~5,500+ Mailbox nodes
~1,000+ Domain nodes
SENT, RECEIVED, CC_ON, USED, and HAS_MAILBOX relationships
Investigate
The graph is live. These queries demonstrate what the metadata model makes possible.
cypher
Top senders
MATCH (m:Mailbox)-[:SENT]->(e:Email)
RETURN m.address AS sender,
count(e) AS emails
ORDER BY emails DESC LIMIT 10
cypher
Cross-domain communication
MATCH (d1:Domain)-[:HAS_MAILBOX]->
(m1:Mailbox)-[:SENT]->(e:Email)
<-[:RECEIVED]-(m2:Mailbox)
<-[:HAS_MAILBOX]-(d2:Domain)
WHERE d1 <> d2
RETURN d1.name AS from_domain,
d2.name AS to_domain,
count(e) AS emails
ORDER BY emails DESC LIMIT 10
What the metadata graph can answer
Adapt the schema to your own data
The schema here — Email, User, Mailbox, Domain — is designed for the Enron email corpus. If your data is different (legal documents, customer support tickets, research papers), your node labels and relationships will be different too. The import pattern is the same: define constraints on unique identifiers, MERGE on normalised values, store raw values as properties. Design your schema around the questions you want to answer, then adapt the Cypher queries accordingly.
The graph captures who sent what to whom and when:
Who are the most connected people in the network?
Which domains communicate most with Enron?
Who bridges between different groups?
What’s the communication pattern around a specific date or event?
What it can’t answer: what they talked about. The body text is still unstructured — names, organizations, topics, and locations mentioned in the content haven’t been extracted. That’s the next course, Entity Extraction: Communication Networks.
Check your understanding
Why constraints matter
What happens if you run MERGE without a uniqueness constraint on the property you’re matching?
❏ Nothing — MERGE works the same either way
❏ Neo4j raises an error and refuses to import
✓ MERGE degrades to CREATE — every row creates a new node instead of matching existing ones, resulting in duplicates
❏ MERGE automatically creates an index
Hint
MERGE needs to find existing nodes efficiently. What does a uniqueness constraint provide besides preventing duplicates?
Solution
A uniqueness constraint creates a backing index. Without it, MERGE has to scan all nodes to check for matches — and on large datasets, this is so slow that it effectively creates a new node every time. Constraints must be in place before importing any data.
Merging on normalized values
The import uses MERGE (u:User {name_norm: row.name_norm}) with ON CREATE SET u.name = row.name. What does this achieve?
❏ It stores only the normalized name on the node
❏ It creates a new node for every row in the CSV
✓ It creates one node per unique normalized name, with the raw name stored as a property for traceability
❏ It updates the raw name on every merge, overwriting previous values
Hint
What does ON CREATE SET do — does it run every time, or only when a new node is created?
Solution
MERGE matches on name_norm. If a node with that normalized name already exists, it’s reused. If not, a new node is created and ON CREATE SET stores the raw name. Subsequent merges with the same name_norm but different raw names reuse the existing node without overwriting — the first raw value is preserved.
Summary
Created an Aura Free instance at console.neo4j.io — no credit card required
Connected via the Python driver using credentials stored in .env
Constraints on _norm properties ensure MERGE matches correctly
Each CSV imported in batches using UNWIND — nodes merged on normalized values, raw values stored as properties
The graph contains Email, User, Mailbox, and Domain nodes connected by SENT, RECEIVED, CC_ON, USED, and HAS_MAILBOX relationships
Metadata queries traverse relationships that would require complex joins in a flat database