Data model for this course
This course uses the movie recommendations dataset as a starting point for your learning.
This is the same dataset that is used in the application development courses in GraphAcademy.
Here is the graph data model:
The node labels for the graph include:
-
Person
-
Actor
-
Director
-
Movie
-
Genre
-
User
The relationships for the graph include:
-
ACTED_IN (with an optional role property)
-
DIRECTED (with an optional role property)
-
RATED (with rating and timestamp properties)
-
IN_GENRE
Also notice that the nodes have a number of properties, along with the type of data that will be used for each property.
Step 1: Identify constraints
You use constraints to:
-
Uniquely identify a node.
-
Ensure a property exists for a node or relationship.
-
Ensure a set of properties is unique and exists for a node (Node key).
Uniqueness constraints
You analyze the data requirements for the application and determine how each node will be uniquely identified. In our Movie graph, we will define uniqueness constraints for these node labels:
-
Movie nodes use
movieId
. -
Person nodes use
tmdbId
. -
User nodes use
userId
. -
Genre nodes use
name
.
Existence constraints
Depending on how data is loaded or updated in the graph, you may want to further constrain that specific properties must exist for nodes or relationships. These constraints are separate from the uniqueness constraints.
For example, you may want to enforce that every role
property of the ACTED_IN
relationship must have a value. Or that a Person
node must have a value for the name
property.
Node key constraints
In addition, there may be a combination of property values for a node that you want to ensure exist and are unique for every node with that label.
For example, there cannot be two Movie nodes in the graph that have the same title and year.
Step 2: Create constraints
Next you create the constraints per your analysis.
Step 3: Load the data
You typically load the data for your application and ensure that all data loaded correctly adhering to the constraints defined. If a constraint is violated, the Cypher load will fail.
A best practice is to always use MERGE
for creating nodes and relationships. MERGE
first does a lookup (using the uniqueness constraint which is an index), then creates the node if it does not exist.
You can use LOAD CSV
to load data or you can use the Neo4j Data Importer App. The Neo4j Data Importer App actually creates the uniqueness constraints for you.
Step 4: Identify indexes
Identifying the indexes for your graph depends on the most important use cases (queries) of your application.
For example, if this is an important query in your application:
// Find all movies for this actor
// aName is a parameter with a string value for an actor
MATCH (p:Person)-[:ACTED_IN]->(m)
WHERE p.name = $aName
RETURN m.title
The anchor of the query is the Person node with a specific name value. This query can benefit from a RANGE index on the name property.
But if your important query is the following:
// Find all actors for a movie with this in the title
// titleSubString is a portion of the title as a string
MATCH (p)-[:ACTED_IN]->(m:Movie)
WHERE m.title CONTAINS $titleSubString
RETURN p.name, m.title
The anchor of the query is the title property of Movie nodes.
The test is CONTAINS
. A RANGE index will help this query, but a TEXT index will perform better.
Through the remainder of this course, you will have an opportunity to create and use constraints and indexes.
Step 5: Create indexes
After you have loaded the data and identified the indexes you will need, you create the indexes.
As you test your application, an important part is testing the performance of the queries. Use cases for the application may change so the identifying and creating indexes to improve query performance will be an ongoing process during the lifecycle of your application.
Check your understanding
1. What are the uniqueness constraints?
Refer to the data model shown at the beginning of this lesson.
Before we load the data into the graph, we need to add constraints for some node labels in the graph per our data model:
-
The person data has a unique value for the tmdbId field.
-
The movie data has a unique value for the movieId field.
-
The user data has a field to uniquely identify itself. There may be reviewers with the same name.
-
The genre data has a field to uniquely identify itself.
What properties do we define uniqueness constraints for? (select all that apply)
-
✓ Person.tmdbId
-
✓ Movie.movieId
-
❏ Actor.name
-
❏ Director.name
-
✓ User.userId
-
❏ Genre.genreId
-
✓ Genre.name
Hint
You do not define uniqueness constraints for the Actor and Director labels as these nodes are created as Person nodes and a node only needs one unique identifier.
This data model requires four constraints.
Solution
The correct answers are:
-
Person.tmdbId
-
Movie.movieId
-
User.userId
-
Genre.name
2. When to create indexes and constraints?
Suppose you are starting with an empty graph and you need to load millions of nodes and relationships into the graph. What is the best practice for when to create indexes and constraints in your graph? (select all that apply)
-
❏ Create all constraints and indexes before you load the data into the graph.
-
❏ Create all constraints and indexes after you load the data into the graph.
-
✓ Create the constraints before you load the data into the graph.
-
✓ Create the indexes after you load the data into the graph.
Hint
You want some checking of data and fast lookups during the loading of the data to prevent duplication of data. Write performance is diminished when indexes need to be maintained.
Solution
The correct answers are:
-
Create the constraints before you load the data into the graph.
-
Create the indexes after you load the data into the graph.
Summary
In this lesson, you learned how to identify the constraints and indexes you will need in your graph. In the next module, you will learn about creating and using constraints.