Normalizing a property to a node

Introduction

In this lesson, you will learn how to refactor your graph by normalizing properties into separate nodes to reduce data duplication and improve data modeling consistency.

Traversing nodes and relationships is fast and efficient in Neo4j. Filtering properties by can be expensive (especially if they are not indexed).

The existence of a node

The existence of a node or relationship in the graph can also be used to indicate a specific state or condition, for example, an Order node with a SHIPPED relationship to an Address node indicates that the order has been shipped to that address.

mermaid

An order has been shipped

graph TD
    Order -->|<i>SHIPPED_TO<i>| Address
        
    Order(("<b>Order</b>"))
    
    Address(("<b><i>Address</i></b>"))

Employee role

An Employee node has a title property that indicates the employee’s role in the company.

cypher

Employee roles

MATCH (employee:Employee)
RETURN employee.firstName, employee.title

You can get a list of all the unique titles with the following query:

cypher

Unique employee titles

MATCH (employee:Employee)
RETURN DISTINCT employee.title

If you profile this query you will see that every Employee node is being scanned and the title property is accessed to determine if it is unique.

DISTINCT and performance

The use of DISTINCT is expensive as Neo4j needs to read all the values and determine the unique set.

Create Role nodes and HAS_ROLE relationships

To normalize this property, you can create a Role node for each unique title and connect the Employee nodes to their corresponding Role nodes with a HAS_ROLE relationship.

mermaid

graph TD
    Employee -->|HAS_ROLE| Role
    
    Employee(("<b>Employee</b>"))

    Role(("<b><i>Role</i></b>"))

Refactor the graph

The following Cypher query creates Role nodes and HAS_ROLE relationships based on the existing title property of Employee nodes:

cypher

Create Role nodes and HAS_ROLE relationships

MATCH (employee:Employee)                              // (1)
MERGE (role:Role {title: employee.title})              // (2)
MERGE (employee)-[r:HAS_ROLE]->(role)                  // (3)
REMOVE employee.title                                  // (4)
RETURN employee, r, role                               // (5)

Find all the employee nodes
Create a Role node for each title
Create a HAS_ROLE relationship between the Employee and Role nodes
Remove the title property from the Employee node
Return the employee, role, and relationship

Add constraint and index

The title property on the Role node is now a unique identifier for the role.

You can add a node key constraint to ensure data integrity and improve query performance:

cypher

Add uniqueness constraint and index on Role title

CREATE CONSTRAINT title_Role_key IF NOT EXISTS
FOR (r:Role) REQUIRE r.title IS NODE KEY

Indexes and constraints

Creating a constraint will also create an index on the same label and property.

Viewing constraints

You can view the constraints in your database with the following query:

cypher

SHOW CONSTRAINTS

You will see the title_Role_key constraint, along with other constraints that were created when the data was imported such as categoryID_Category_uniq.

Querying roles

Querying roles is now more efficient as you can use the Role label.

cypher

MATCH (r:Role)
RETURN r.title

Querying employees by role

Introducing a 1 step traversal to get the role of an employee is slightly more expensive, but it is more efficient to query for all employees with a specific role.

cypher

Query employees with a specific role

MATCH (r:Role {title: "Sales Representative"})<-[:HAS_ROLE]-(e:Employee)
RETURN r.title, e.firstName

Detach delete

You can use DETACH DELETE to delete a node and all its relationships in one step.

To remove the Role nodes and HAS_ROLE relationships and revert back to the original model, you could use the following query:

cypher

Remove Role nodes and HAS_ROLE relationships

MATCH (employee:Employee)-[:HAS_ROLE]->(role:Role)
SET employee.title = role.title
DETACH DELETE role

The title property is SET on the employee note and DETACH DELETE deletes the Role node and all HAS_ROLE relationships.

Addresses

The data model contains several addresses which are represented as properties on the Order, Customer, Employee, and Supplier nodes.

These addresses are not connected to other nodes in the graph and are not easily queryable or comparable.

The shipping address for an order is represented by the following properties on the Order node:

cypher

Order shipping addresses

MATCH (order:Order)
RETURN DISTINCT
  order.shipAddress,
  order.shipCity,
  order.shipRegion,
  order.shipPostalCode,
  order.shipCountry

Normalizing shipping addresses

The shipping address can be normalized to a separate Address node and connected to the Order node with a SHIPPED_TO relationship.

mermaid

graph TD
    Order -->|<i>SHIPPED_TO<i>| Address
        
    Order(("<b>Order</b>"))
    
    Address(("<b><i>Address</i></b>"))

Normalizing the shipping address would allow you to easily query for all orders shipped to an address or compare addresses to find orders shipped to the same or new location.

Creating shipping Address nodes

The following Cypher query normalizes the shipping address creating Address nodes and SHIPPED_TO relationships:

cypher

Normalize shipping address to an Address node

MATCH (order:Order)                                        // (1)
MERGE (address:Address {                                   // (2)
  address: order.shipAddress,
  city: order.shipCity,
  region: order.shipRegion,
  postalCode: order.shipPostalCode,
  country: order.shipCountry})
REMOVE                                                     // (3)
  order.shipAddress, order.shipCity,
  order.shipRegion, order.shipPostalCode,
  order.shipCountry
MERGE (order)-[s:SHIPPED_TO]->(address)                    // (4)
RETURN order, s, address                                   // (5)

Find all the Order nodes.
Create an Address node for each unique combination of address properties.
Remove the address properties from the Order node.
Create a SHIPPED_TO relationship between the Order and Address nodes.
Return the order, relationship, and address.

Merge on address properties

The merge is happening on all the address properties to ensure that you only create one Address node for each unique address.

Creating the address constraint

The combination of address properties can be used as a unique identifier for the Address node.

The following query creates a node key constraint on all the address properties:

cypher

CREATE CONSTRAINT Address_key IF NOT EXISTS
FOR (a:Address)
REQUIRE (a.address, a.city, a.region, a.postalCode, a.country)
IS NODE KEY

An index is also created on the address properties to improve query performance when searching for addresses.

Create constraints before refactoring

It is best practice, particularly when dealing with large datasets, to create constraints before refactoring your graph.

Creating the constraint early will ensure data integrity and the index will improve query performance during the refactoring process.

Creating address indexes

To further improve query performance on the Address nodes, you can create individual indexes on the properties that are commonly queried.

For example, if you frequently query for addresses by country, you can create an index on the country property:

cypher

Create Address country index

CREATE INDEX address_country
IF NOT EXISTS
FOR (a:Address) ON a.country

Viewing indexes

You can view indexes in your database with the following query:

cypher

View indexes

SHOW INDEXES

Querying addresses

You can use the Address nodes and SHIPPED_TO relationships to query for orders shipped to a specific country or to compare addresses:

cypher

Orders shipped to the USA

MATCH (order:Order)-[:SHIPPED_TO]->(address:Address {country: "USA"})
RETURN
  order.orderID,
  address.address,
  address.city,
  address.region,
  address.postalCode

Top shipped addresses

You can find the top shipped addresses by counting the number of orders shipped to each address:

cypher

Top shipped addresses

MATCH (order:Order)-[:SHIPPED_TO]->(address:Address)
RETURN
  address.address,
  address.city,
  address.region,
  address.postalCode,
  address.country,
  count(order) AS ordersShipped
ORDER BY ordersShipped DESC
LIMIT 10

Refactor all addresses

You challenge is to refactor all the addresses in the graph by normalizing the address properties on the Employee, Customer, and Supplier nodes to the Address nodes.

mermaid

Address nodes and relationships

graph TD
    Order -->|<i>SHIPPED_TO<i>| Address
    Supplier --> Address
    Customer --> Address
    Employee --> Address
        
    Order(("<b>Order</b>"))
       
    Supplier(("<b>Supplier</b>"))
    
    Customer(("<b>Customer</b>"))
    
    Employee(("<b>Employee</b>"))

    Address(("<b><i>Address</i></b>"))

You will need to:

Identify the address properties on the Employee, Customer, and Supplier nodes.
Define a suitable relationship types for the Address to the Employee, Customer, and Supplier nodes.
Create a query (or queries) to refactor the graph by creating Address nodes and relationships to the Employee, Customer, and Supplier nodes.
Create queries to analyze the new graph structure and answer questions such as - What orders are shipped to an address different to the customer’s known address?

Click to reveal the solution

The address properties on the Employee, Customer, and Supplier nodes are:
- address
- city
- region
- postalCode
- country

A suitable relationship type could be LOCATED_AT.

mermaid

graph TD
    Order -->|<i>SHIPPED_TO<i>| Address
    Supplier -->|<i>LOCATED_AT<i>| Address
    Customer -->|<i>LOCATED_AT<i>| Address
    Employee -->|<i>LOCATED_AT<i>| Address
        
    Order(("<b>Order</b>"))
       
    Supplier(("<b>Supplier</b>"))
    
    Customer(("<b>Customer</b>"))
    
    Employee(("<b>Employee</b>"))

    Address(("<b><i>Address</i></b>"))

The following query refactors the graph by creating Address nodes and LOCATED_AT relationships to the Employee, Customer, and Supplier nodes:

cypher

MATCH (n:Employee|Customer|Supplier)
MERGE (address:Address {
  address: n.address,
  city: n.city,
  region: n.region,
  postalCode: n.postalCode,
  country: n.country})
REMOVE
  n.address,
  n.city,
  n.region,
  n.postalCode,
  n.country
MERGE (n)-[l:LOCATED_AT]->(address)
RETURN n, l, address

Find orders shipped to an address different than the customer’s known address:

cypher

MATCH (c:Customer)-[:LOCATED_AT]-(customerAddress)
MATCH (c)-[:PURCHASED]->(o:Order)-[:SHIPPED_TO]->(shipAddress)
WHERE customerAddress <> shipAddress
RETURN o.orderID, customerAddress.address, shipAddress.address

Lesson Summary

In this lesson, you learned about normalizing properties to nodes to reduce data duplication and create a more maintainable graph structure.

In the next challenge, you will identify an opportunity to refactor the graph.

Neo4j Management, Optimization, and Refactoring Workshop

Setup Neo4j Aura

Aura Administration

Query Performance

Graph refactoring

Normalizing a property to a node

Introduction

The existence of a node

Employee role

Create Role nodes and HAS_ROLE relationships

Refactor the graph

Add constraint and index

Viewing constraints

Querying roles

Querying employees by role

Addresses

Normalizing shipping addresses

Creating shipping Address nodes

Creating the address constraint

Creating address indexes

Viewing indexes

Querying addresses

Top shipped addresses

Refactor all addresses

Lesson Summary

Chatbot