Query Patterns and Performance

Patterns in the graph

The Neo4j graph engine is implemented to traverse relationships very quickly. As you become more experienced with Cypher queries, you will soon learn that there are multiple ways to write a query that returns the same results. The difference in queries is typically its traversal performance. In this lesson, you will begin learning about graph traversal and query performance.

A pattern is a combination of nodes and relationships that is used to traverse the graph at runtime. You can write queries that test whether a pattern exists in the graph.

Here is an example:

cypher
MATCH (p:Person)-[:ACTED_IN]->(m:Movie)
WHERE  p.name = 'Tom Hanks'
AND exists {(p)-[:DIRECTED]->(m)}
RETURN p.name, labels(p), m.title

This query:

  1. Retrieves the anchor of the query, the Tom Hanks :Person node.

  2. It then follows the :ACTED_IN relationship to a :Movie node.

  3. Then, for the Movie node and Person node, it tests whether these nodes are related by the DIRECTED relationship.

  4. If they are, then the row is returned.

This exists { } test is done for every Movie node related to Tom Hanks as an actor. This query returns the single movie that Tom Hanks directed and acted in.

Profiling queries

You can use the PROFILE keyword to show the total number of rows retrieved from the graph in the query.

cypher
PROFILE MATCH (p:Person)-[:ACTED_IN]->(m:Movie)
WHERE  p.name = 'Tom Hanks'
AND exists {(p)-[:DIRECTED]->(m)}
RETURN m.title

In the profile, you can see that the initial row is retrieved, but then 38 rows are retrieved for each Movie that Tom Hanks acted in. Then the test is done for the :DIRECTED relationship.

This is a better way to do the same query. This is a query that you have seen before.

cypher
PROFILE MATCH (p:Person)-[:ACTED_IN]->(m:Movie)<-[:DIRECTED]-(p)
WHERE  p.name = 'Tom Hanks'
RETURN  m.title

The query:

  1. Retrieves the anchor (the Tom Hanks Person node).

  2. It then finds a Movie node where Tom Hanks is related to with the ACTED_IN relationship.

  3. It then traverses all DIRECTED relationships that point to the same Tom Hanks node.

This traversal is very efficient because the graph engine can take the [internal] relationship cardinalities into account. If you execute this query, it returns the same result as the previous query; the movie title Larry Crowne.

Notice, however that this query is much more efficient. It retrieves one row then two rows; much less data than the first query. Note that the performance of queries that use patterns will depend upon the data model for your graph and also the number of nodes in the traversal.

The difference between using EXPLAIN and PROFILE is that EXPLAIN provides estimates of the query steps where PROFILE provides the exact steps and number of rows retrieved for the query. Providing you are simply querying the graph and not updating anything, it is fine to execute the query multiple times using PROFILE. In fact, as part of query tuning, you should execute the query at least twice as the first execution involves the generation of the execution plan which is then cached. That is, the first PROFILE of a query will always be more expensive than subsequent queries.

Query tuning is beyond the scope of this course, but it is important to profile your queries so that you can optimize the queries of your application. The metric that is typically a good measure of query performance is the number db hits.

Finding non-patterns

You’ve just seen that using a pattern and testing the existence of a pattern in our graph for this course is not optimal. This may or may not be the case, depending on your data model. There is a scenario where using exists { } for a pattern is useful. You use NOT exists { } to exclude patterns in the graph.

We want to find all the movies that Tom Hanks acted in, but did not direct.

Here is the best way to do this:

cypher
MATCH (p:Person)-[:ACTED_IN]->(m:Movie)
WHERE  p.name = 'Tom Hanks'
AND NOT exists {(p)-[:DIRECTED]->(m)}
RETURN  m.title

Here we want to exclude the :DIRECTED relationships to movies for Tom Hanks. If you profile this query, you will find that it is not performant, but it is the only way to perform this query.

Check your understanding

1. Testing if a pattern exists in the graph

We want to return the movies that Clint Eastwood acted in and directed.

How would you complete this query?

Once you have selected your option, click the Check Results query button to continue.

cypher
MATCH (p:Person)-[:ACTED_IN]->(m:Movie)
WHERE  p.name = "Clint Eastwood"
/*select:AND exists {(p)-[:DIRECTED]->(m)}*/
RETURN  m.title
  • AND exists { (p)-[:DIRECTED]→(m) }

  • AND NOT exists { (p)-[:DIRECTED]→(m) }

  • AND p:Director

  • AND p:Director AND p:Actor

Hint

Does the DIRECTED relationship exist between the Clint Eastwood node and the Movie node?

Solution

The correct answer is: AND exists { (p)-[:DIRECTED]→(m) }. We need to check if the relationship exists so NOT exists is incorrect.

Performing the test of the label Director or Director and Actor is not sufficient because we are focussing on movies that Clint Eastwood both acted in and directed.

2. Query performance

What Cypher keyword helps you to understand the performance of a query when it runs?

  • INSPECT

  • MEASURE

  • EXPLAIN

  • PROFILE

Hint

You prepend your queries with this keyword and it show things such as db hits when the query executes.

Solution

The correct answer is: PROFILE. It provides both the execution plan and the db hits when the query executes.

EXPLAIN only provides the query plan.

MEASURE and INSPECT are not valid Cypher keywords.

Summary

In this lesson, you began to learn about patterns in the graph and how to measure the performance of a query.

In the next challenge, you will write a query that uses exists { } to exclude part of the graph.

Chatbot

Hi, I am an Educational Learning Assistant for Intelligent Network Exploration. You can call me E.L.A.I.N.E.

How can I help you today?