Introduction
In this lesson, you will learn how to refactor your graph by normalizing properties into separate nodes to reduce data duplication and improve data modeling consistency.
Traversing nodes and relationships is fast and efficient in Neo4j. Filtering properties by can be expensive (especially if they are not indexed).
The existence of a node
The existence of a node or relationship in the graph can also be used to indicate a specific state or condition, for example, an Order node with a SHIPPED relationship to an Address node indicates that the order has been shipped to that address.
graph TD
Order -->|<i>SHIPPED_TO<i>| Address
Order(("<b>Order</b>"))
Address(("<b><i>Address</i></b>"))Employee role
An Employee node has a title property that indicates the employee’s role in the company.
MATCH (employee:Employee)
RETURN employee.firstName, employee.titleYou can get a list of all the unique titles with the following query:
MATCH (employee:Employee)
RETURN DISTINCT employee.titleIf you profile this query you will see that every Employee node is being scanned and the title property is accessed to determine if it is unique.
DISTINCT and performance
The use of DISTINCT is expensive as Neo4j needs to read all the values and determine the unique set.
Create Role nodes and HAS_ROLE relationships
To normalize this property, you can create a Role node for each unique title and connect the Employee nodes to their corresponding Role nodes with a HAS_ROLE relationship.
graph TD
Employee -->|HAS_ROLE| Role
Employee(("<b>Employee</b>"))
Role(("<b><i>Role</i></b>"))Refactor the graph
The following Cypher query creates Role nodes and HAS_ROLE relationships based on the existing title property of Employee nodes:
MATCH (employee:Employee) // (1)
MERGE (role:Role {title: employee.title}) // (2)
MERGE (employee)-[r:HAS_ROLE]->(role) // (3)
REMOVE employee.title // (4)
RETURN employee, r, role // (5)-
Find all the employee nodes
-
Create a Role node for each title
-
Create a HAS_ROLE relationship between the
EmployeeandRolenodes -
Remove the title property from the Employee node
-
Return the employee, role, and relationship
Add constraint and index
The title property on the Role node is now a unique identifier for the role.
You can add a node key constraint to ensure data integrity and improve query performance:
CREATE CONSTRAINT title_Role_key IF NOT EXISTS
FOR (r:Role) REQUIRE r.title IS NODE KEYIndexes and constraints
Creating a constraint will also create an index on the same label and property.
Viewing constraints
You can view the constraints in your database with the following query:
SHOW CONSTRAINTSYou will see the title_Role_key constraint, along with other constraints that were created when the data was imported such as categoryID_Category_uniq.
Querying roles
Querying roles is now more efficient as you can use the Role label.
MATCH (r:Role)
RETURN r.titleQuerying employees by role
Introducing a 1 step traversal to get the role of an employee is slightly more expensive, but it is more efficient to query for all employees with a specific role.
MATCH (r:Role {title: "Sales Representative"})<-[:HAS_ROLE]-(e:Employee)
RETURN r.title, e.firstNameDetach delete
You can use DETACH DELETE to delete a node and all its relationships in one step.
To remove the Role nodes and HAS_ROLE relationships and revert back to the original model, you could use the following query:
MATCH (employee:Employee)-[:HAS_ROLE]->(role:Role)
SET employee.title = role.title
DETACH DELETE roleThe title property is SET on the employee note and DETACH DELETE deletes the Role node and all HAS_ROLE relationships.
Addresses
The data model contains several addresses which are represented as properties on the Order, Customer, Employee, and Supplier nodes.
These addresses are not connected to other nodes in the graph and are not easily queryable or comparable.
The shipping address for an order is represented by the following properties on the Order node:
MATCH (order:Order)
RETURN DISTINCT
order.shipAddress,
order.shipCity,
order.shipRegion,
order.shipPostalCode,
order.shipCountryNormalizing shipping addresses
The shipping address can be normalized to a separate Address node and connected to the Order node with a SHIPPED_TO relationship.
graph TD
Order -->|<i>SHIPPED_TO<i>| Address
Order(("<b>Order</b>"))
Address(("<b><i>Address</i></b>"))Normalizing the shipping address would allow you to easily query for all orders shipped to an address or compare addresses to find orders shipped to the same or new location.
Creating shipping Address nodes
The following Cypher query normalizes the shipping address creating Address nodes and SHIPPED_TO relationships:
MATCH (order:Order) // (1)
MERGE (address:Address { // (2)
address: order.shipAddress,
city: order.shipCity,
region: order.shipRegion,
postalCode: order.shipPostalCode,
country: order.shipCountry})
REMOVE // (3)
order.shipAddress, order.shipCity,
order.shipRegion, order.shipPostalCode,
order.shipCountry
MERGE (order)-[s:SHIPPED_TO]->(address) // (4)
RETURN order, s, address // (5)-
Find all the
Ordernodes. -
Create an
Addressnode for each unique combination of address properties. -
Remove the address properties from the
Ordernode. -
Create a
SHIPPED_TOrelationship between theOrderandAddressnodes. -
Return the order, relationship, and address.
Merge on address properties
The merge is happening on all the address properties to ensure that you only create one Address node for each unique address.
Creating the address constraint
The combination of address properties can be used as a unique identifier for the Address node.
The following query creates a node key constraint on all the address properties:
CREATE CONSTRAINT Address_key IF NOT EXISTS
FOR (a:Address)
REQUIRE (a.address, a.city, a.region, a.postalCode, a.country)
IS NODE KEYAn index is also created on the address properties to improve query performance when searching for addresses.
Create constraints before refactoring
It is best practice, particularly when dealing with large datasets, to create constraints before refactoring your graph.
Creating the constraint early will ensure data integrity and the index will improve query performance during the refactoring process.
Creating address indexes
To further improve query performance on the Address nodes, you can create individual indexes on the properties that are commonly queried.
For example, if you frequently query for addresses by country, you can create an index on the country property:
CREATE INDEX address_country
IF NOT EXISTS
FOR (a:Address) ON a.countryViewing indexes
You can view indexes in your database with the following query:
SHOW INDEXESQuerying addresses
You can use the Address nodes and SHIPPED_TO relationships to query for orders shipped to a specific country or to compare addresses:
MATCH (order:Order)-[:SHIPPED_TO]->(address:Address {country: "USA"})
RETURN
order.orderID,
address.address,
address.city,
address.region,
address.postalCodeTop shipped addresses
You can find the top shipped addresses by counting the number of orders shipped to each address:
MATCH (order:Order)-[:SHIPPED_TO]->(address:Address)
RETURN
address.address,
address.city,
address.region,
address.postalCode,
address.country,
count(order) AS ordersShipped
ORDER BY ordersShipped DESC
LIMIT 10Refactor all addresses
You challenge is to refactor all the addresses in the graph by normalizing the address properties on the Employee, Customer, and Supplier nodes to the Address nodes.
graph TD
Order -->|<i>SHIPPED_TO<i>| Address
Supplier --> Address
Customer --> Address
Employee --> Address
Order(("<b>Order</b>"))
Supplier(("<b>Supplier</b>"))
Customer(("<b>Customer</b>"))
Employee(("<b>Employee</b>"))
Address(("<b><i>Address</i></b>"))You will need to:
-
Identify the address properties on the
Employee,Customer, andSuppliernodes. -
Define a suitable relationship types for the
Addressto theEmployee,Customer, andSuppliernodes. -
Create a query (or queries) to refactor the graph by creating
Addressnodes and relationships to theEmployee,Customer, andSuppliernodes. -
Create queries to analyze the new graph structure and answer questions such as -
What orders are shipped to an address different to the customer’s known address?
Click to reveal the solution
-
The address properties on the
Employee,Customer, andSuppliernodes are:-
address -
city -
region -
postalCode -
country
-
-
A suitable relationship type could be
LOCATED_AT.mermaidgraph TD Order -->|<i>SHIPPED_TO<i>| Address Supplier -->|<i>LOCATED_AT<i>| Address Customer -->|<i>LOCATED_AT<i>| Address Employee -->|<i>LOCATED_AT<i>| Address Order(("<b>Order</b>")) Supplier(("<b>Supplier</b>")) Customer(("<b>Customer</b>")) Employee(("<b>Employee</b>")) Address(("<b><i>Address</i></b>")) -
The following query refactors the graph by creating
Addressnodes andLOCATED_ATrelationships to theEmployee,Customer, andSuppliernodes:cypherMATCH (n:Employee|Customer|Supplier) MERGE (address:Address { address: n.address, city: n.city, region: n.region, postalCode: n.postalCode, country: n.country}) REMOVE n.address, n.city, n.region, n.postalCode, n.country MERGE (n)-[l:LOCATED_AT]->(address) RETURN n, l, address -
Find orders shipped to an address different than the customer’s known address:
cypherMATCH (c:Customer)-[:LOCATED_AT]-(customerAddress) MATCH (c)-[:PURCHASED]->(o:Order)-[:SHIPPED_TO]->(shipAddress) WHERE customerAddress <> shipAddress RETURN o.orderID, customerAddress.address, shipAddress.address
Lesson Summary
In this lesson, you learned about normalizing properties to nodes to reduce data duplication and create a more maintainable graph structure.
In the next challenge, you will identify an opportunity to refactor the graph.