Normalizing a property to a node

Introduction

In this lesson, you will learn how to refactor your graph by normalizing properties into separate nodes to reduce data duplication and improve data modeling consistency.

Traversing nodes and relationships is fast and efficient in Neo4j. Filtering properties by can be expensive (especially if they are not indexed).

The existence of a node

The existence of a node or relationship in the graph can also be used to indicate a specific state or condition, for example, an Order node with a SHIPPED relationship to an Address node indicates that the order has been shipped to that address.

mermaid
An order has been shipped
graph TD
    Order -->|<i>SHIPPED_TO<i>| Address
        
    Order(("<b>Order</b>"))
    
    Address(("<b><i>Address</i></b>"))

Employee role

An Employee node has a title property that indicates the employee’s role in the company.

cypher
Employee roles
MATCH (employee:Employee)
RETURN employee.firstName, employee.title

You can get a list of all the unique titles with the following query:

cypher
Unique employee titles
MATCH (employee:Employee)
RETURN DISTINCT employee.title

If you profile this query you will see that every Employee node is being scanned and the title property is accessed to determine if it is unique.

DISTINCT and performance

The use of DISTINCT is expensive as Neo4j needs to read all the values and determine the unique set.

Create Role nodes and HAS_ROLE relationships

To normalize this property, you can create a Role node for each unique title and connect the Employee nodes to their corresponding Role nodes with a HAS_ROLE relationship.

mermaid
graph TD
    Employee -->|HAS_ROLE| Role
    
    Employee(("<b>Employee</b>"))

    Role(("<b><i>Role</i></b>"))

Refactor the graph

The following Cypher query creates Role nodes and HAS_ROLE relationships based on the existing title property of Employee nodes:

cypher
Create Role nodes and HAS_ROLE relationships
MATCH (employee:Employee)                              // (1)
MERGE (role:Role {title: employee.title})              // (2)
MERGE (employee)-[r:HAS_ROLE]->(role)                  // (3)
REMOVE employee.title                                  // (4)
RETURN employee, r, role                               // (5)
  1. Find all the employee nodes

  2. Create a Role node for each title

  3. Create a HAS_ROLE relationship between the Employee and Role nodes

  4. Remove the title property from the Employee node

  5. Return the employee, role, and relationship

Add constraint and index

The title property on the Role node is now a unique identifier for the role.

You can add a node key constraint to ensure data integrity and improve query performance:

cypher
Add uniqueness constraint and index on Role title
CREATE CONSTRAINT title_Role_key IF NOT EXISTS
FOR (r:Role) REQUIRE r.title IS NODE KEY

Indexes and constraints

Creating a constraint will also create an index on the same label and property.

Viewing constraints

You can view the constraints in your database with the following query:

cypher
SHOW CONSTRAINTS

You will see the title_Role_key constraint, along with other constraints that were created when the data was imported such as categoryID_Category_uniq.

Querying roles

Querying roles is now more efficient as you can use the Role label.

cypher
MATCH (r:Role)
RETURN r.title

Querying employees by role

Introducing a 1 step traversal to get the role of an employee is slightly more expensive, but it is more efficient to query for all employees with a specific role.

cypher
Query employees with a specific role
MATCH (r:Role {title: "Sales Representative"})<-[:HAS_ROLE]-(e:Employee)
RETURN r.title, e.firstName

Detach delete

You can use DETACH DELETE to delete a node and all its relationships in one step.

To remove the Role nodes and HAS_ROLE relationships and revert back to the original model, you could use the following query:

cypher
Remove Role nodes and HAS_ROLE relationships
MATCH (employee:Employee)-[:HAS_ROLE]->(role:Role)
SET employee.title = role.title
DETACH DELETE role

The title property is SET on the employee note and DETACH DELETE deletes the Role node and all HAS_ROLE relationships.

Addresses

The data model contains several addresses which are represented as properties on the Order, Customer, Employee, and Supplier nodes.

These addresses are not connected to other nodes in the graph and are not easily queryable or comparable.

The shipping address for an order is represented by the following properties on the Order node:

cypher
Order shipping addresses
MATCH (order:Order)
RETURN DISTINCT
  order.shipAddress,
  order.shipCity,
  order.shipRegion,
  order.shipPostalCode,
  order.shipCountry

Normalizing shipping addresses

The shipping address can be normalized to a separate Address node and connected to the Order node with a SHIPPED_TO relationship.

mermaid
graph TD
    Order -->|<i>SHIPPED_TO<i>| Address
        
    Order(("<b>Order</b>"))
    
    Address(("<b><i>Address</i></b>"))

Normalizing the shipping address would allow you to easily query for all orders shipped to an address or compare addresses to find orders shipped to the same or new location.

Creating shipping Address nodes

The following Cypher query normalizes the shipping address creating Address nodes and SHIPPED_TO relationships:

cypher
Normalize shipping address to an Address node
MATCH (order:Order)                                        // (1)
MERGE (address:Address {                                   // (2)
  address: order.shipAddress,
  city: order.shipCity,
  region: order.shipRegion,
  postalCode: order.shipPostalCode,
  country: order.shipCountry})
REMOVE                                                     // (3)
  order.shipAddress, order.shipCity,
  order.shipRegion, order.shipPostalCode,
  order.shipCountry
MERGE (order)-[s:SHIPPED_TO]->(address)                    // (4)
RETURN order, s, address                                   // (5)
  1. Find all the Order nodes.

  2. Create an Address node for each unique combination of address properties.

  3. Remove the address properties from the Order node.

  4. Create a SHIPPED_TO relationship between the Order and Address nodes.

  5. Return the order, relationship, and address.

Merge on address properties

The merge is happening on all the address properties to ensure that you only create one Address node for each unique address.

Creating the address constraint

The combination of address properties can be used as a unique identifier for the Address node.

The following query creates a node key constraint on all the address properties:

cypher
CREATE CONSTRAINT Address_key IF NOT EXISTS
FOR (a:Address)
REQUIRE (a.address, a.city, a.region, a.postalCode, a.country)
IS NODE KEY

An index is also created on the address properties to improve query performance when searching for addresses.

Create constraints before refactoring

It is best practice, particularly when dealing with large datasets, to create constraints before refactoring your graph.

Creating the constraint early will ensure data integrity and the index will improve query performance during the refactoring process.

Creating address indexes

To further improve query performance on the Address nodes, you can create individual indexes on the properties that are commonly queried.

For example, if you frequently query for addresses by country, you can create an index on the country property:

cypher
Create Address country index
CREATE INDEX address_country
IF NOT EXISTS
FOR (a:Address) ON a.country

Viewing indexes

You can view indexes in your database with the following query:

cypher
View indexes
SHOW INDEXES

Querying addresses

You can use the Address nodes and SHIPPED_TO relationships to query for orders shipped to a specific country or to compare addresses:

cypher
Orders shipped to the USA
MATCH (order:Order)-[:SHIPPED_TO]->(address:Address {country: "USA"})
RETURN
  order.orderID,
  address.address,
  address.city,
  address.region,
  address.postalCode

Top shipped addresses

You can find the top shipped addresses by counting the number of orders shipped to each address:

cypher
Top shipped addresses
MATCH (order:Order)-[:SHIPPED_TO]->(address:Address)
RETURN
  address.address,
  address.city,
  address.region,
  address.postalCode,
  address.country,
  count(order) AS ordersShipped
ORDER BY ordersShipped DESC
LIMIT 10

Refactor all addresses

You challenge is to refactor all the addresses in the graph by normalizing the address properties on the Employee, Customer, and Supplier nodes to the Address nodes.

mermaid
Address nodes and relationships
graph TD
    Order -->|<i>SHIPPED_TO<i>| Address
    Supplier --> Address
    Customer --> Address
    Employee --> Address
        
    Order(("<b>Order</b>"))
       
    Supplier(("<b>Supplier</b>"))
    
    Customer(("<b>Customer</b>"))
    
    Employee(("<b>Employee</b>"))

    Address(("<b><i>Address</i></b>"))

You will need to:

  1. Identify the address properties on the Employee, Customer, and Supplier nodes.

  2. Define a suitable relationship types for the Address to the Employee, Customer, and Supplier nodes.

  3. Create a query (or queries) to refactor the graph by creating Address nodes and relationships to the Employee, Customer, and Supplier nodes.

  4. Create queries to analyze the new graph structure and answer questions such as - What orders are shipped to an address different to the customer’s known address?

Click to reveal the solution
  1. The address properties on the Employee, Customer, and Supplier nodes are:

    • address

    • city

    • region

    • postalCode

    • country

  2. A suitable relationship type could be LOCATED_AT.

    mermaid
    graph TD
        Order -->|<i>SHIPPED_TO<i>| Address
        Supplier -->|<i>LOCATED_AT<i>| Address
        Customer -->|<i>LOCATED_AT<i>| Address
        Employee -->|<i>LOCATED_AT<i>| Address
            
        Order(("<b>Order</b>"))
           
        Supplier(("<b>Supplier</b>"))
        
        Customer(("<b>Customer</b>"))
        
        Employee(("<b>Employee</b>"))
    
        Address(("<b><i>Address</i></b>"))
  3. The following query refactors the graph by creating Address nodes and LOCATED_AT relationships to the Employee, Customer, and Supplier nodes:

    cypher
    MATCH (n:Employee|Customer|Supplier)
    MERGE (address:Address {
      address: n.address,
      city: n.city,
      region: n.region,
      postalCode: n.postalCode,
      country: n.country})
    REMOVE
      n.address,
      n.city,
      n.region,
      n.postalCode,
      n.country
    MERGE (n)-[l:LOCATED_AT]->(address)
    RETURN n, l, address
  4. Find orders shipped to an address different than the customer’s known address:

    cypher
    MATCH (c:Customer)-[:LOCATED_AT]-(customerAddress)
    MATCH (c)-[:PURCHASED]->(o:Order)-[:SHIPPED_TO]->(shipAddress)
    WHERE customerAddress <> shipAddress
    RETURN o.orderID, customerAddress.address, shipAddress.address

Lesson Summary

In this lesson, you learned about normalizing properties to nodes to reduce data duplication and create a more maintainable graph structure.

In the next challenge, you will identify an opportunity to refactor the graph.

Chatbot

How can I help you today?