What can a list contain?
A list is an array that contains elements. The elements in a list need not all be of the same type. Here is an example:
MATCH (m:Movie)
RETURN [m.title, m.released, date().year - date(m.released).year + 1 ]This query returns one row for each movie in the graph. The data in each row is a list that contains the title of the movie, the year the movie was released, and the number of years ago the movie was released.
Most queries use collect() to create lists, but there are other functions you can use that also create lists, such as nodes().
You can use nodes() to return a list of nodes in a path.
The list you create can be something that you return. In this query we return the collection of movie nodes for each actor:
MATCH (a:Actor)--(m:Movie)
WITH a, collect (m) AS Movies
RETURN a AS Actor, Movies LIMIT 10If you view the results using table view, you will see that each movie node for an actor is shown as a list enclosed with '[…]'.
If you are just interested in returning the list of titles of movies, it is more efficient to extract the title property from the node, rather than returning the list of nodes as follows:
MATCH (a:Actor)--(m:Movie)
WITH a, collect (m.title) AS Movies
RETURN a.name AS Actor, Movies LIMIT 10How collect() works
collect() returns a list of elements. You can use collect() to create a list at any time during your query.
When you create a list during a query, aggregation occurs.
Grouping key
During aggregation, the graph engine typically groups data based upon some value in a row. Here is a query where each row is returned by the name of the actor and the aggregation is done for each actor. The grouping key is a.name. That is, the non-aggregated value is the grouping key.
MATCH (a:Person)-[:ACTED_IN]->(m:Movie)
RETURN a.name AS Actor, collect(m.title) AS Movies LIMIT 100Here is another example:
MATCH (actor:Person)-[:ACTED_IN]->(m:Movie)<-[:DIRECTED]-(director:Person)
RETURN actor.name AS Actor, director.name AS Director,
count(m) AS Collaborations, collect(m.title) AS MoviesIn this example, both director.name and actor.name are grouping keys.
Using collect() examples
Here is an example where collect() is used to return aggregated data.
Execute this query to see the query plan:
PROFILE
MATCH (m:Movie {title:'Jupiter Ascending'})
MATCH (d:Person)-[:DIRECTED]->(m)
MATCH (a:Person)-[:ACTED_IN]->(m)
RETURN m.title AS Title, collect(DISTINCT a.name) AS Actors,
collect(DISTINCT d.name) AS DirectorsWhat you must remember is that each MATCH in a query returns rows that are operated on by the next step of the query.
In this query:
-
The first
MATCHreturns one row, for the movie, Jupiter Ascending. -
The second
MATCHuses the row from the previousMATCHand then returns two rows, one for each person who directed the movie. -
The third
MATCHuses the two rows from the previousMATCHand then returns eight rows. This is because for each row for the movie, Jupiter Ascending, there are four actors. Aggregation (OrderedAggregation step in the query plan) occurs during theRETURNclause. -
All names of actors are collected or aggregated, as well as all names of directors. When aggregation occurs, the graph engine selects a grouping key. In this example the movie title is the grouping key. All data collected into the Actors and Directors will be grouped by the movie title.
-
In the
RETURNwe also specifyDISTINCTso that we will not have duplicate names in the lists returned. If we had not specifiedDISTINCTfor the aggregation, we would have seen multiple actor names, one for each row in the thirdMATCHclause.
You can imagine that in a graph with many relationships, a query tuning goal should be to reduce the number of rows processed in the query (looking at the query plan). Even if you perform a single MATCH, you will see that the same query plan is used:
PROFILE
MATCH (d:Person)-[:DIRECTED]->(m:Movie {title:'Jupiter Ascending'})<-[:ACTED_IN]-(a:Person)
RETURN m.title AS Title, collect(DISTINCT a.name) AS Actors,
collect(DISTINCT d.name) AS DirectorsTo reduce the number of rows processed, we can move the aggregation up in the query as follows:
PROFILE
MATCH (m:Movie {title:'Jupiter Ascending'})
MATCH (d:Person)-[:DIRECTED]->(m)
WITH m, collect (d.name) AS Directors
MATCH (a:Person)-[:ACTED_IN]->(m)
RETURN m.title AS Title, collect(a.name) AS Actors, DirectorsAfter the second MATCH we collect the row into a list.
Notice that in the query plan the EagerAggregation operator executes.
We pass on the Directors list, along with the Movie node for the next MATCH.
Notice that in the query plan, only one row is passed into the third MATCH clause.
We no longer need to use DISTINCT because there is no longer duplicate data in the rows.
In your query tuning, you should strive to reduce the number of rows processed during a query.
Collecting nodes
You can also collect nodes so that you may use the list of nodes for later processing in the query. Here is a query where we create Language nodes from data in the Movie nodes:
MATCH (m:Movie)
UNWIND m.languages AS language
WITH language, collect(m) AS movies
MERGE (l:Language {name:language})
WITH l, movies
UNWIND movies AS m
WITH l,m
MERGE (l)<-[:SPEAKS]-(m)
// NOTE: to undo these merges:
// MATCH (x:Language) DETACH DELETE xHere you see in this code the following:
-
The first
MATCHretrieves all Movie nodes. -
Each value in the languages property is returned.
-
We then collect the movie nodes and use the value of language as the grouping key. That is all of the movies with a value of English will be collected into a list.
-
We then create the Language node using the language value.
-
With the language value and the list of Movie nodes for that language, we unwind the movie nodes into rows.
-
With each language and row for each Movie, we create the relationship between the Language node and the Movie node.
This code collects nodes using the language value so that each node can be used to create the relationship.
In this type of query, you typically use collect() to collect nodes and UNWIND lists so they can be processed later in the query.
Collecting relationships
Here is an example where we collect relationships, then unwind them to query and return the correct values:
MATCH (u:User {name: "Misty Williams"})-[x]->()
WITH collect(x) AS ratings
UNWIND ratings AS r
WITH r WHERE r.rating <= 1
RETURN r.rating AS Rating, endNode(r).title AS TitleNotice here that we use endNode() to return the node at the end of the relationship.
Collecting paths in a query
Here is an example where we perform a query and use the nodes() function to return the list of all nodes in a path:
// Movies in all paths that begin with Elvis and end with an Actor that are 4 hops away from Elvis
MATCH path = (p:Person {name: 'Elvis Presley'})-[*4]-(a:Actor)
WITH nodes(path) AS n
UNWIND n AS x
WITH x WHERE x:Movie
RETURN DISTINCT x.titleIn the above query, there is a mixture of Person and Movie nodes in the path.
We specify a criteria x:Movie that we only want to return Movie nodes.
collect() vs. subquery
A subquery is a another way to aggregate data for later processing in your queries. Suppose we have this query which returns a list of at most three actors for every movie:
PROFILE MATCH (m:Movie)<-[:ACTED_IN]-(p)
WITH m, collect(p.name) AS Actors
WHERE size(Actors) <= 3
RETURN m.title AS Movie, ActorsThis query returns 208 movies that have 3 or fewer actors.
The query can be rewritten as follows using a subquery:
PROFILE
CALL {
MATCH (m:Movie)<-[:ACTED_IN]-(p)
WITH m , collect(p.name) as Actors
WHERE size(Actors)<= 3
RETURN m.title as Movie, Actors
}
RETURN Movie, ActorsBoth queries perform the same.
Check your understanding
1. Aggregating data
Which Cypher function is most commonly used to aggregate data into a list?
-
❏
count() -
❏
aggregate() -
✓
collect() -
❏
gather()
Hint
This Cypher function is typically used to collect data as a list.
Solution
You use the collect() function to create a list. A list typically contains elements of the same type, but this is not a requirement.
2. Ordering results returned
Here is a query that returns values to the client for movies released in 2000 and 2001.
MATCH (m:Movie)
WHERE 2000 <= m.year <= 2001
MATCH (d:Person)-[:DIRECTED]->(m)
MATCH (a:Person)-[:ACTED_IN]->(m)
RETURN m.year AS Year, m.title AS Title, m.released AS ReleaseDate, collect(DISTINCT a.name) AS Actors,
collect(DISTINCT d.name) AS DirectorsThe results are returned so that all rows for the year 2000 are returned first, then all rows for the year 2001. When results are returned for aggregated results such as Actors and Directors, what is m.year considered?
-
❏ Sort key
-
✓ Grouping key
-
❏ Sort property
-
❏ Primary key
Hint
In a Cypher RETURN clause this key is used to sort rows by a common value.
Solution
The correct answer is Grouping key.
Summary
In this lesson, you reviewed and learned more about aggregating data in your Cypher queries.
In the next challenge, you will create a query to aggregate using collect().