What does pipelining mean?
You have learned that you can use the WITH
clause to redefine the scope for the query:
WITH 'Tom Hanks' AS theActor
MATCH (p:Person)-[:ACTED_IN]->(m:Movie)
WHERE p.name = theActor
WITH m LIMIT 5
// possibly do more with the five m nodes
RETURN m.title AS movies
This query limits the query to retrieve five nodes.
You can do something with the five nodes by adding another MATCH
clause:
WITH 'Tom Hanks' AS theActor
MATCH (p:Person)-[:ACTED_IN]->(m:Movie)
WHERE p.name = theActor
WITH m LIMIT 5
MATCH (d:Person)-[:DIRECTED]->(m)
RETURN d.name AS director,
m.title AS movies
The five Movie nodes are used for the second MATCH
clause and five rows are returned.
The WITH
clause enables us to pipeline the results of the first query into the second query.
With this simple query, m need not be redefined or scoped.
Next you will learn how pipelining is really useful.
Using WITH for aggregation
A very common use of WITH
is to aggregate data so that intermediate results can be used to create the final returned data or to pass the intermediate results to the
next part of the query.
Here is an example:
MATCH (:Movie {title: 'Toy Story'})-[:IN_GENRE]->(g:Genre)<-[:IN_GENRE]-(m)
WHERE m.imdbRating IS NOT NULL
WITH g.name AS genre,
count(m) AS moviesInCommon,
sum(m.imdbRating) AS total
RETURN genre, moviesInCommon,
total/moviesInCommon AS score
ORDER By score DESC
In this query we are counting the number of movies that share the same Genre node.
We use count() to count the number of rows and sum() to total the imdbRating for each movie for the Genre.
These values are calculated as part of the aggregation and then used to return the data.
In the WITH
clause we pass on only the values we need to return a row for the Genre.
Here is another example that shows aggregation and pipelining:
MATCH (u:User {name: "Misty Williams"})-[r:RATED]->(:Movie)
WITH u, avg(r.rating) AS average
MATCH (u)-[r:RATED]->(m:Movie)
WHERE r.rating > average
RETURN average , m.title AS movie,
r.rating as rating
ORDER BY rating DESC
For this query, we first calculate the average rating for all movies that Misty Williams rated.
Then we use this calculated value, average as a test for the second MATCH
.
Using WITH for collecting
Another common use for the WITH
clause is to collect results into a list that will be returned:
MATCH (m:Movie)--(a:Actor)
WHERE m.title CONTAINS 'New York'
WITH m, collect (a.name) AS actors,
count(*) AS numActors
RETURN m.title AS movieTitle, actors
ORDER BY numActors DESC
This query collects the names of actors who acted in the movies containing the string 'New York'. This aggregation collects the names and totals the number of actors.
Here is another example where we perform a 2-step aggregation for collecting a list of maps:
MATCH (m:Movie)<-[:ACTED_IN]-(a:Actor)
WHERE m.title CONTAINS 'New York'
WITH m, collect (a.name) AS actors,
count(*) AS numActors
ORDER BY numActors DESC
RETURN collect(m { .title, actors, numActors }) AS movies
Using LIMIT early
A best practice is to execute queries that minimize the number of rows processed in the query. One way to do that is to limit early in the query. This also helps in reducing the number of properties loaded from the database too early.
For example, this query limits during the aggregation:
PROFILE MATCH (p:Person)-[:ACTED_IN]->(m:Movie)
WHERE p.born.year = 1980
WITH p,
collect(m.title) AS movies LIMIT 3
RETURN p.name AS actor, movies
A better way to do this query is to limit early. This query will perform slightly better because LIMIT is moved up in the query:
PROFILE MATCH (p:Actor)
WHERE p.born.year = 1980
WITH p LIMIT 3
MATCH (p)-[:ACTED_IN]->(m:Movie)
WITH p, collect(m.title) AS movies
RETURN p.name AS actor, movies
Use DISTINCT when necessary
Here is an example of a query that retrieves three actors, then collects the names of the genres for the movies that actor acted in.
MATCH (p:Actor)
WHERE p.born.year = 1980
WITH p LIMIT 3
MATCH (p)-[:ACTED_IN]->(m:Movie)-[:IN_GENRE]->(g:Genre)
WITH p, collect(g.name) AS genres
RETURN p.name AS actor, genres
Notice that for this query, the collected genre names are repeated. You would want to ensure that they are not duplicated as follows:
MATCH (p:Actor)
WHERE p.born.year = 1980
WITH p LIMIT 3
MATCH (p)-[:ACTED_IN]->(m:Movie)-[:IN_GENRE]->(g:Genre)
WITH p, collect(DISTINCT g.name) AS genres
RETURN p.name AS actor, genres
Check your understanding
Limiting results
Here is a query to return the names of Directors who directed movies that Keanu Reeves acted in. How do you specify that you want to limit the number of rows returned to 3?
Once you have selected your option, click the Check Results query button to continue.
MATCH (p:Person)-[:ACTED_IN]->(m:Movie)
WHERE p.name = 'Keanu Reeves'
/*select:WITH m LIMIT 3*/
MATCH (d:Person)-[:DIRECTED]->(m)
RETURN collect(d.name) AS directors,
m.title AS movies
-
❏
WITH LIMIT 3
-
✓
WITH m LIMIT 3
-
❏
AND count(m) = 3
-
❏
LIMIT 3
Hint
The WITH
clause must carry forward the m variable and limit how many movie nodes are carried forward.
Solution
The correct answer is: WITH m LIMIT 3
The variable, m must be carried forward so it can be used in the RETURN clause.
2. Cypher for aggregating
What Cypher functions/keywords are used to aggregate intermediate results?
-
❏
LIMIT
-
✓
count()
-
✓
collect()
-
❏
DISTINCT
Hint
These two functions aggregate data in the query.
Solution
The correct answers are: count()
and collect()
LIMIT
is used to limit the number of rows processed, but it does not aggregate data.
DISTINCT
is used to ensure that a row or property value is not repeated in the result.
Summary
In this lesson, you learned how WITH
is used to create intermediate results that can be used for pipelining queries.
In the next challenge, you will write a query that uses WITH
to provide intermediate results for further query processing.