Patterns in the graph
The Neo4j graph engine is implemented to traverse relationships very quickly. As you become more experienced with Cypher queries, you will soon learn that there are multiple ways to write a query that returns the same results. The difference in queries is typically its traversal performance. In this lesson, you will begin learning about graph traversal and query performance.
A pattern is a combination of nodes and relationships that is used to traverse the graph at runtime. You can write queries that test whether a pattern exists in the graph.
Here is an example:
MATCH (p:Person)-[:ACTED_IN]->(m:Movie)
WHERE p.name = 'Tom Hanks'
AND exists {(p)-[:DIRECTED]->(m)}
RETURN p.name, labels(p), m.title
This query:
-
Retrieves the anchor of the query, the Tom Hanks
:Person
node. -
It then follows the
:ACTED_IN
relationship to a:Movie
node. -
Then, for the Movie node and Person node, it tests whether these nodes are related by the DIRECTED relationship.
-
If they are, then the row is returned.
This exists { }
test is done for every Movie node related to Tom Hanks as an actor.
This query returns the single movie that Tom Hanks directed and acted in.
Profiling queries
You can use the PROFILE
keyword to show the total number of rows retrieved from the graph in the query.
PROFILE MATCH (p:Person)-[:ACTED_IN]->(m:Movie)
WHERE p.name = 'Tom Hanks'
AND exists {(p)-[:DIRECTED]->(m)}
RETURN m.title
In the profile, you can see that the initial row is retrieved, but then 38 rows are retrieved for each Movie that
Tom Hanks acted in.
Then the test is done for the :DIRECTED
relationship.
This is a better way to do the same query. This is a query that you have seen before.
PROFILE MATCH (p:Person)-[:ACTED_IN]->(m:Movie)<-[:DIRECTED]-(p)
WHERE p.name = 'Tom Hanks'
RETURN m.title
The query:
-
Retrieves the anchor (the Tom Hanks Person node).
-
It then finds a Movie node where Tom Hanks is related to with the ACTED_IN relationship.
-
It then traverses all DIRECTED relationships that point to the same Tom Hanks node.
This traversal is very efficient because the graph engine can take the [internal] relationship cardinalities into account. If you execute this query, it returns the same result as the previous query; the movie title Larry Crowne.
Notice, however that this query is much more efficient. It retrieves one row then two rows; much less data than the first query. Note that the performance of queries that use patterns will depend upon the data model for your graph and also the number of nodes in the traversal.
The difference between using EXPLAIN
and PROFILE
is that EXPLAIN
provides estimates of the query
steps where PROFILE
provides the exact steps and number of rows retrieved for the query.
Providing you are simply querying the graph and not updating anything, it is fine to execute the
query multiple times using PROFILE
.
In fact, as part of query tuning, you should execute the query at least twice as the first execution
involves the generation of the execution plan which is then cached.
That is, the first PROFILE of a query will always be more expensive than subsequent queries.
Query tuning is beyond the scope of this course, but it is important to profile your queries so that you can optimize the queries of your application. The metric that is typically a good measure of query performance is the number db hits.
Finding non-patterns
You’ve just seen that using a pattern and testing the existence of a pattern in our graph for this course
is not optimal.
This may or may not be the case, depending on your data model.
There is a scenario where using exists { }
for a pattern is useful.
You use NOT exists { }
to exclude patterns in the graph.
We want to find all the movies that Tom Hanks acted in, but did not direct.
Here is the best way to do this:
MATCH (p:Person)-[:ACTED_IN]->(m:Movie)
WHERE p.name = 'Tom Hanks'
AND NOT exists {(p)-[:DIRECTED]->(m)}
RETURN m.title
Here we want to exclude the :DIRECTED
relationships to movies for Tom Hanks.
If you profile this query, you will find that it is not performant, but it is the only way to perform this query.
Check your understanding
1. Testing if a pattern exists in the graph
We want to return the movies that Clint Eastwood acted in and directed.
How would you complete this query?
Once you have selected your option, click the Check Results query button to continue.
MATCH (p:Person)-[:ACTED_IN]->(m:Movie)
WHERE p.name = "Clint Eastwood"
/*select:AND exists {(p)-[:DIRECTED]->(m)}*/
RETURN m.title
-
✓
AND exists { (p)-[:DIRECTED]→(m) }
-
❏
AND NOT exists { (p)-[:DIRECTED]→(m) }
-
❏
AND p:Director
-
❏
AND p:Director AND p:Actor
Hint
Does the DIRECTED relationship exist between the Clint Eastwood node and the Movie node?
Solution
The correct answer is: AND exists { (p)-[:DIRECTED]→(m) }
.
We need to check if the relationship exists so NOT exists
is incorrect.
Performing the test of the label Director or Director and Actor is not sufficient because we are focussing on movies that Clint Eastwood both acted in and directed.
2. Query performance
What Cypher keyword helps you to understand the performance of a query when it runs?
-
❏
INSPECT
-
❏
MEASURE
-
❏
EXPLAIN
-
✓
PROFILE
Hint
You prepend your queries with this keyword and it show things such as db hits when the query executes.
Solution
The correct answer is: PROFILE
. It provides both the execution plan and the db hits when the query executes.
EXPLAIN
only provides the query plan.
MEASURE
and INSPECT
are not valid Cypher keywords.
Summary
In this lesson, you began to learn about patterns in the graph and how to measure the performance of a query.
In the next challenge, you will write a query that uses exists { } to exclude part of the graph.