Querying duplicate data
Here is our current instance model where each Movie node has a languages property:
For our latest use case:
Use case #11: What movies are available in a particular language?
This query finds all movies in Italian:
MATCH (m:Movie)
WHERE 'Italian' IN m.languages
RETURN m.title
What this query does is retrieve all Movie nodes and then test whether the languages property contains Italian. There are two issues with the data model, especially if the graph scales:
-
The name of the language is duplicated in many Movie nodes.
-
In order to perform the query, all Movie nodes must be retrieved.
A solution here is to model properties as nodes.
Refactor properties as nodes
Here are the steps we use to refactor:
-
We take the property values for each Movie node and create a Language node.
-
Then we create the IN_LANGUAGE relationship between that Movie node and the Language node.
-
Finally, we remove the languages property from the Movie node.
This is the code to refactor the graph to turn the property values into nodes:
MATCH (m:Movie)
UNWIND m.languages AS language
WITH language, collect(m) AS movies
MERGE (l:Language {name:language})
WITH l, movies
UNWIND movies AS m
WITH l,m
MERGE (m)-[:IN_LANGUAGE]->(l);
MATCH (m:Movie)
SET m.languages = null
This code iterates through all Movie nodes and creates a Language node for each language it finds and then creates the relationship between the Movie node and Language node using the IN_LANGUAGE relationship.
It uses the Cypher UNWIND
clause to separate each element of the languages property list into a separate row value that is processed later in the query.
This is what the instance model looks like after the refactoring:
There will only be one node with the language value of English and we remove the languages property from all Movie nodes. This eliminates a lot of duplication in the graph.
Check your understanding
Separating lists
What Cypher clause do you use to separate list elements?
-
❏
MATCH
-
❏
WITH
-
❏
RETURN
-
✓
UNWIND
Hint
This clause operates on a list to return a row value.
Solution
The clause used to separate a list into individual elements is called UNWIND
.
Summary
In this lesson, you learned how to create nodes from properties containing lists of values to eliminate duplication. In the next challenge, you perform a refactoring that creates Language nodes from the languages properties.