Handling Complex Data Scenarios

This lesson shows how to handle: NULL values, data type mismatches, self-referencing relationships, large datasets, duplicate data, and many-to-many with extra properties.

Running Cypher

The techniques in this lesson apply to Neo4j Aura. For Cypher-based solutions, use the Query tool in your AuraDB instance.

Handling NULL Values

Relational databases use NULL to represent missing or unknown values. When importing into Neo4j:

Properties with NULL values are not created - Neo4j does not store NULL properties
Use COALESCE in Cypher to provide default values if needed
Decide what NULL means — "Unknown" vs "not applicable" affects whether you omit, use a default, or use conditional logic

Example: Handling NULL Regions

In the Northwind database, the region column is often NULL for non-US customers:

sql

SELECT customer_id, company_name, region
FROM customers
WHERE region IS NULL;

When importing, choose one approach:

Skip the property - Let Neo4j omit it naturally (recommended if NULL means "not applicable")
Use a default value - Use COALESCE in Cypher to set region to "N/A" when NULL
Use conditional logic - Use CASE WHEN row.region IS NOT NULL THEN row.region END to omit the property when NULL

Handling Data Type Mismatches

Sometimes source data does not match expected types:

Mixed Data Types in Columns

If a column contains mixed types (e.g., numbers and text):

sql

-- Find non-numeric values in a supposedly numeric column
SELECT postal_code FROM customers
WHERE postal_code !~ '^[0-9]+$';

Fix before or during import:

Import as String, then convert with toInteger() or toFloat() in follow-up queries
Clean in SQL (e.g. remove non-numeric rows) before export
Use CASE WHEN to route values to different properties

Date Format Variations

Dates may be stored in different formats:

sql

-- Check date formats
SELECT DISTINCT TO_CHAR(order_date, 'YYYY-MM-DD') as formatted_date
FROM orders
LIMIT 10;

Convert to ISO format (YYYY-MM-DD) in SQL before export, or use datetime(row.order_date) in Cypher during import.

Handling Self-Referencing Relationships

The employees table has a self-referencing reports_to column for the management hierarchy.

Challenge: Order of Operations

When creating REPORTS_TO relationships:

All Employee nodes must exist first
Then create the relationships

This requires a two-pass import:

Pass 1: Create Employee nodes

cypher

LOAD CSV WITH HEADERS FROM 'file:///employees.csv' AS row
CREATE (e:Employee {
  employeeID: toInteger(row.employee_id),
  firstName: row.first_name,
  lastName: row.last_name
})

Pass 2: Create REPORTS_TO relationships

cypher

LOAD CSV WITH HEADERS FROM 'file:///employees.csv' AS row
MATCH (e:Employee {employeeID: toInteger(row.employee_id)})
MATCH (m:Employee {employeeID: toInteger(row.reports_to)})
CREATE (e)-[:REPORTS_TO]->(m)

Handling Large Datasets

For large imports, use batch processing:

Batch Processing

Instead of importing all data at once, process in batches:

cypher

:auto LOAD CSV WITH HEADERS FROM 'file:///large_file.csv' AS row
CALL {
  WITH row
  CREATE (n:Node {id: row.id, name: row.name})
} IN TRANSACTIONS OF 10000 ROWS

Periodic Commit (Legacy)

For older Neo4j versions:

cypher

USING PERIODIC COMMIT 10000
LOAD CSV WITH HEADERS FROM 'file:///large_file.csv' AS row
CREATE (n:Node {id: row.id})

Handling Duplicate Data

Relational data may contain duplicates that need handling:

Detecting Duplicates

sql

-- Find duplicate company names
SELECT company_name, COUNT(*)
FROM customers
GROUP BY company_name
HAVING COUNT(*) > 1;

Strategies for Duplicates

Merge duplicates - Combine into a single node
Keep all records - Use a unique identifier to distinguish
Clean before import - Remove duplicates in the source

Handling Many-to-Many with Additional Data

When junction tables have extra columns beyond foreign keys, choose:

Use relationship properties - For simple additional data
Create intermediate nodes - For complex data or when you need to query the junction independently

Example: Order Details with Complex Data

If order_details had shipment tracking, return status, and other complex data:

// Option 1: Relationship properties (simple case)
(Order)-[:CONTAINS {quantity: 10, unitPrice: 15.00}]->(Product)

// Option 2: Intermediate node (complex case)
(Order)-[:HAS_LINE_ITEM]->(OrderLine)-[:FOR_PRODUCT]->(Product)

Best Practices for Complex Data Import

Best practices from real-world import projects:

Bookmark the best practices checklist

Bookmark this section. These nine best practices serve as a reference checklist for any data import project you undertake in the future.

Best Practice 1: Create Constraints Before Import

Always create unique constraints before importing any data.

cypher

// Create all constraints first
CREATE CONSTRAINT customer_id IF NOT EXISTS
FOR (c:Customer) REQUIRE c.customerID IS UNIQUE

cypher

CREATE CONSTRAINT order_id IF NOT EXISTS
FOR (o:Order) REQUIRE o.orderID IS UNIQUE

cypher

CREATE CONSTRAINT product_id IF NOT EXISTS
FOR (p:Product) REQUIRE p.productID IS UNIQUE

Why this matters:

Prevents duplicate nodes during import
Enables efficient MERGE operations
Creates indexes automatically for faster lookups
Catches data quality issues early

Verify constraints before importing

Run SHOW CONSTRAINTS after creating them to verify they are in place before starting the import.

Best Practice 2: Import in the Correct Order

Follow this sequence for reliable imports:

Step	Action	Reason
1	Create constraints	Enable MERGE and prevent duplicates
2	Import independent nodes first	Categories, Suppliers, Shippers (no foreign keys to other entities)
3	Import dependent nodes	Products (needs Categories, Suppliers), Customers, Employees
4	Import transaction nodes	Orders (needs Customers, Employees, Shippers)
5	Create relationships	All referenced nodes must exist

Step

Action

Reason

Create constraints

Enable MERGE and prevent duplicates

Import independent nodes first

Categories, Suppliers, Shippers (no foreign keys to other entities)

Import dependent nodes

Products (needs Categories, Suppliers), Customers, Employees

Import transaction nodes

Orders (needs Customers, Employees, Shippers)

Create relationships

All referenced nodes must exist

Northwind Import Order:

// Step 2: Independent nodes
LOAD CSV ... // Categories
LOAD CSV ... // Suppliers
LOAD CSV ... // Shippers

// Step 3: Dependent nodes
LOAD CSV ... // Products (references Categories, Suppliers)
LOAD CSV ... // Customers
LOAD CSV ... // Employees

// Step 4: Transaction nodes
LOAD CSV ... // Orders (references Customers, Employees, Shippers)

// Step 5: Relationships
// PLACED, CONTAINS, IN_CATEGORY, SUPPLIES, REPORTS_TO, etc.

Best Practice 3: Use MERGE for Idempotent Imports

Make your imports re-runnable without creating duplicates:

cypher

// Bad: Creates duplicates on re-run
LOAD CSV WITH HEADERS FROM 'file:///customers.csv' AS row
CREATE (c:Customer {customerID: row.customer_id, companyName: row.company_name})

cypher

// Good: Idempotent - safe to re-run
LOAD CSV WITH HEADERS FROM 'file:///customers.csv' AS row
MERGE (c:Customer {customerID: row.customer_id})
SET c.companyName = row.company_name,
    c.contactName = row.contact_name,
    c.city = row.city,
    c.country = row.country

When to use MERGE and CREATE:

Command	Use When	Example
CREATE	First-time import with guaranteed unique data	Initial bulk load with clean data
MERGE	Re-runnable imports, incremental updates	Regular sync from source system
MERGE + ON CREATE/ON MATCH	Different behavior for new and existing	Update timestamps only on changes

Command

Use When

Example

CREATE

First-time import with guaranteed unique data

Initial bulk load with clean data

MERGE

Re-runnable imports, incremental updates

Regular sync from source system

MERGE + ON CREATE/ON MATCH

Different behavior for new and existing

Update timestamps only on changes

Best Practice 4: Handle NULL Values Explicitly

Do not let NULL values cause silent failures:

cypher

// Bad: May fail silently or create unexpected results
LOAD CSV WITH HEADERS FROM 'file:///customers.csv' AS row
MERGE (c:Customer {customerID: row.customer_id})
SET c.region = row.region

cypher

// Good: Explicit NULL handling
LOAD CSV WITH HEADERS FROM 'file:///customers.csv' AS row
MERGE (c:Customer {customerID: row.customer_id})
SET c.companyName = row.company_name,
    c.region = CASE WHEN row.region IS NOT NULL AND row.region <> ''
                    THEN row.region
                    ELSE null END

NULL Handling Strategies:

Strategy	When to Use	Implementation
Omit property	Value is truly optional	Use CASE to return null
Default value	Need consistent property for queries	Use COALESCE(row.field, 'Unknown')
Separate label	NULL indicates a category	Add label like :NoRegion

Strategy

When to Use

Implementation

Omit property

Value is truly optional

Use CASE to return null

Default value

Need consistent property for queries

Use COALESCE(row.field, 'Unknown')

Separate label

NULL indicates a category

Add label like :NoRegion

Best Practice 5: Validate Foreign Keys Before Creating Relationships

Avoid creating relationships to non-existent nodes:

cypher

// Bad: Silently fails if customer does not exist
LOAD CSV WITH HEADERS FROM 'file:///orders.csv' AS row
MATCH (c:Customer {customerID: row.customer_id})
MATCH (o:Order {orderID: toInteger(row.order_id)})
CREATE (c)-[:PLACED]->(o)

cypher

// Good: Log missing references
LOAD CSV WITH HEADERS FROM 'file:///orders.csv' AS row
OPTIONAL MATCH (c:Customer {customerID: row.customer_id})
WITH row, c
WHERE c IS NULL
RETURN row.order_id AS orderWithMissingCustomer, row.customer_id AS missingCustomerID

Pre-import validation query:

cypher

// Check for orphan foreign keys before import
LOAD CSV WITH HEADERS FROM 'file:///orders.csv' AS row
WITH row.customer_id AS custId
WHERE NOT EXISTS ((c:Customer {customerID: custId}))
RETURN DISTINCT custId AS missingCustomer

Best Practice 6: Use Transactions for Related Data

Keep related data consistent:

cypher

// Import order and its details in one transaction
:auto LOAD CSV WITH HEADERS FROM 'file:///orders_with_details.csv' AS row
CALL {
  WITH row
  MERGE (o:Order {orderID: toInteger(row.order_id)})
  SET o.orderDate = date(row.order_date)

  WITH o, row
  MATCH (p:Product {productID: toInteger(row.product_id)})
  MERGE (o)-[r:CONTAINS]->(p)
  SET r.quantity = toInteger(row.quantity),
      r.unitPrice = toFloat(row.unit_price)
} IN TRANSACTIONS OF 1000 ROWS

Best Practice 7: Log and Monitor Import Progress

Track what is happening during import:

cypher

// Before import: Record starting counts
MATCH (n)
RETURN labels(n)[0] AS label, COUNT(*) AS beforeCount

cypher

// After import: Compare counts
MATCH (n)
RETURN labels(n)[0] AS label, COUNT(*) AS afterCount

cypher

// Check for import issues
MATCH (o:Order)
WHERE NOT EXISTS ((:Customer)-[:PLACED]->(o))
RETURN COUNT(o) AS ordersWithoutCustomer

Create an import log:

cypher

// Create import metadata node
CREATE (i:ImportLog {
  timestamp: datetime(),
  source: 'northwind',
  customersImported: 91,
  ordersImported: 830,
  status: 'COMPLETED'
})

Best Practice 8: Clean Up After Failed Imports

If an import fails partway through:

cypher

// Remove partially imported data (be careful!)
// First, identify what was imported in the failed batch
MATCH (n)
WHERE n.importBatch = 'batch_2024_01_15'
DETACH DELETE n

cypher

// Or remove nodes without expected relationships
MATCH (o:Order)
WHERE NOT EXISTS ((:Customer)-[:PLACED]->(o))
DETACH DELETE o

Backup before running DELETE operations

Always backup your database before running DELETE operations, especially DETACH DELETE which removes nodes and all their relationships.

Best Practice 9: Document Your Import Process

Create a runbook for your import:

# Northwind Import Runbook

## Prerequisites

- [ ] Northwind database or CSV files available
- [ ] Neo4j database accessible
- [ ] CSV files exported to import directory

## Step 1: Create Constraints
Run: constraints.cypher
Expected: 7 constraints created

## Step 2: Import Categories
Run: import-categories.cypher
Expected: 8 Category nodes

## Step 3: Import Suppliers
Run: import-suppliers.cypher
Expected: 29 Supplier nodes

... (continue for all steps)

## Validation
Run: validate-counts.cypher
Run: validate-relationships.cypher
Run: validate-sample-data.cypher

Organizing Your Queries

In Neo4j Aura

In the Aura Query tool, create folders to organize the import queries from this lesson. The folder names reflect the execution order:

Folder: 01-Setup-Constraints - Save constraint creation queries (from Best Practice 1)
Folder: 02-Import-Independent-Nodes - Save queries for Categories, Suppliers, Shippers
Folder: 03-Import-Dependent-Nodes - Save queries for Products, Customers, Employees
Folder: 04-Import-Transaction-Nodes - Save queries for Orders
Folder: 05-Create-Relationships - Save relationship creation queries
Folder: 06-Validation - Save count verification and integrity check queries
Folder: Maintenance-Cleanup - Save cleanup queries for failed imports (from Best Practice 8)
Folder: Maintenance-Monitoring - Save progress monitoring queries (from Best Practice 7)

To create folders in Aura:

Open the Query tool
Click the + button next to "Saved Cypher"
Select New Folder
Name the folder and save queries inside it

In AuraDB, use the Query tool’s saved queries with descriptive names and prefixes (e.g., "01-Constraints", "02-Categories") to execute queries in the correct dependency order.

Best Practices Summary Checklist

Use this checklist for every import:

Constraints created before import
Import order follows dependency chain
Using MERGE for idempotent imports
NULL values handled explicitly
Foreign keys validated before relationship creation
Related data imported in transactions
Import progress logged and monitored
Cleanup plan ready for failed imports
Import process documented

Check Your Understanding

NULL in Neo4j imports

What happens when you import a property with a NULL value into Neo4j?

❏ Neo4j stores the property with a NULL value
❏ Neo4j throws an error and stops the import
✓ Neo4j does not create the property at all
❏ Neo4j converts NULL to an empty string

Hint

In Neo4j, a property that has no value is omitted from the node; there is no stored NULL. Relational databases store NULL explicitly.

Solution

When importing data with NULL values, Neo4j does not create the property at all.

Unlike relational databases that explicitly store NULL values, Neo4j simply omits properties that have no value. This means nodes of the same label can have different sets of properties.

If you need to distinguish between "unknown" and "not applicable," use a specific value such as "N/A" or "Unknown" instead of relying on NULL.

Import Best Practices

What is the correct order for importing relational data into Neo4j?

❏ Import all nodes first, then create all constraints, then create relationships
❏ Import relationships first, then nodes, then constraints
✓ Create constraints first, then import independent nodes, then dependent nodes, then relationships
❏ The order does not matter as long as all data is imported

Hint

Relationships connect two nodes, so those nodes must exist first. Constraints prevent duplicates and speed MERGE; the safe order is constraints, then independent nodes, then dependent nodes, then relationships.

Solution

The correct order is:

Create constraints first - Enables MERGE operations and prevents duplicates
Import independent nodes — Nodes with no foreign keys such as Categories, Suppliers, Shippers
Import dependent nodes — Nodes that reference other nodes such as Products, Customers, Employees, Orders
Create relationships - All referenced nodes must exist first

This order ensures: * No duplicate nodes are created * All foreign key references can be resolved * Relationships connect to existing nodes * Import is idempotent and can be re-run safely

Summary

In this lesson, you learned strategies for handling:

NULL values in source data
Data type mismatches
Self-referencing relationships
Large dataset imports
Duplicate data
Complex many-to-many relationships

These techniques address the complexities of production data imports that simplified examples often omit.

Importing Relational Data into Neo4j

Foundation

Designing the Graph Model

Importing Data

Validating and Querying

Handling Complex Data Scenarios

Handling NULL Values

Example: Handling NULL Regions

Handling Data Type Mismatches

Mixed Data Types in Columns

Date Format Variations

Handling Self-Referencing Relationships

Challenge: Order of Operations

Handling Large Datasets

Batch Processing

Periodic Commit (Legacy)

Handling Duplicate Data

Detecting Duplicates

Strategies for Duplicates

Handling Many-to-Many with Additional Data

Example: Order Details with Complex Data

Best Practices for Complex Data Import

Best Practice 1: Create Constraints Before Import

Best Practice 2: Import in the Correct Order

Best Practice 3: Use MERGE for Idempotent Imports

Best Practice 4: Handle NULL Values Explicitly

Best Practice 5: Validate Foreign Keys Before Creating Relationships

Best Practice 6: Use Transactions for Related Data

Best Practice 7: Log and Monitor Import Progress

Best Practice 8: Clean Up After Failed Imports

Best Practice 9: Document Your Import Process

Organizing Your Queries

In Neo4j Aura

Best Practices Summary Checklist

Check Your Understanding

NULL in Neo4j imports

Import Best Practices

Summary

Chatbot