Creating Lists

This video was recorded using an earlier version of the sandbox (Neo4j 4.x). You are using a Neo4j 5.x sandbox for this course. You will see some differences in the PROFILE results for db hits and elapsed time in Neo4j 5.4.

What can a list contain?

A list is an array that contains elements. The elements in a list need not all be of the same type. Here is an example:

cypher
MATCH (m:Movie)
RETURN [m.title, m.released, date().year - date(m.released).year + 1 ]

This query returns one row for each movie in the graph. The data in each row is a list that contains the title of the movie, the year the movie was released, and the number of years ago the movie was released.

Most queries use collect() to create lists, but there are other functions you can use that also create lists, such as nodes(). You can use nodes() to return a list of nodes in a path.

The list you create can be something that you return. In this query we return the collection of movie nodes for each actor:

cypher
MATCH (a:Actor)--(m:Movie)
WITH a, collect (m) AS Movies
RETURN a AS Actor, Movies LIMIT 10

If you view the results using table view, you will see that each movie node for an actor is shown as a list enclosed with '[…​]'.

If you are just interested in returning the list of titles of movies, it is more efficient to extract the title property from the node, rather than returning the list of nodes as follows:

cypher
MATCH (a:Actor)--(m:Movie)
WITH a, collect (m.title) AS Movies
RETURN a.name AS Actor, Movies LIMIT 10

How collect() works

collect() returns a list of elements. You can use collect() to create a list at any time during your query. When you create a list during a query, aggregation occurs.

Grouping key

During aggregation, the graph engine typically groups data based upon some value in a row. Here is a query where each row is returned by the name of the actor and the aggregation is done for each actor. The grouping key is a.name. That is, the non-aggregated value is the grouping key.

cypher
MATCH (a:Person)-[:ACTED_IN]->(m:Movie)
RETURN a.name AS Actor, collect(m.title) AS Movies LIMIT 100

Here is another example:

cypher
MATCH (actor:Person)-[:ACTED_IN]->(m:Movie)<-[:DIRECTED]-(director:Person)
RETURN actor.name AS Actor, director.name AS Director,
count(m) AS Collaborations, collect(m.title) AS Movies

In this example, both director.name and actor.name are grouping keys.

Using collect() examples

Here is an example where collect() is used to return aggregated data. Execute this query to see the query plan:

cypher
PROFILE
MATCH (m:Movie {title:'Jupiter Ascending'})
MATCH (d:Person)-[:DIRECTED]->(m)
MATCH (a:Person)-[:ACTED_IN]->(m)
RETURN m.title AS Title, collect(DISTINCT a.name) AS Actors,
collect(DISTINCT d.name) AS Directors

What you must remember is that each MATCH in a query returns rows that are operated on by the next step of the query.

In this query:

  1. The first MATCH returns one row, for the movie, Jupiter Ascending.

  2. The second MATCH uses the row from the previous MATCH and then returns two rows, one for each person who directed the movie.

  3. The third MATCH uses the two rows from the previous MATCH and then returns eight rows. This is because for each row for the movie, Jupiter Ascending, there are four actors. Aggregation (OrderedAggregation step in the query plan) occurs during the RETURN clause.

  4. All names of actors are collected or aggregated, as well as all names of directors. When aggregation occurs, the graph engine selects a grouping key. In this example the movie title is the grouping key. All data collected into the Actors and Directors will be grouped by the movie title.

  5. In the RETURN we also specify DISTINCT so that we will not have duplicate names in the lists returned. If we had not specified DISTINCT for the aggregation, we would have seen multiple actor names, one for each row in the third MATCH clause.

You can imagine that in a graph with many relationships, a query tuning goal should be to reduce the number of rows processed in the query (looking at the query plan). Even if you perform a single MATCH, you will see that the same query plan is used:

cypher
PROFILE
MATCH (d:Person)-[:DIRECTED]->(m:Movie {title:'Jupiter Ascending'})<-[:ACTED_IN]-(a:Person)
RETURN m.title AS Title, collect(DISTINCT a.name) AS Actors,
collect(DISTINCT d.name) AS Directors

To reduce the number of rows processed, we can move the aggregation up in the query as follows:

cypher
PROFILE
MATCH (m:Movie {title:'Jupiter Ascending'})
MATCH (d:Person)-[:DIRECTED]->(m)
WITH m, collect (d.name) AS Directors
MATCH (a:Person)-[:ACTED_IN]->(m)
RETURN m.title AS Title, collect(a.name) AS Actors, Directors

After the second MATCH we collect the row into a list. Notice that in the query plan the EagerAggregation operator executes. We pass on the Directors list, along with the Movie node for the next MATCH. Notice that in the query plan, only one row is passed into the third MATCH clause. We no longer need to use DISTINCT because there is no longer duplicate data in the rows.

In your query tuning, you should strive to reduce the number of rows processed during a query.

Collecting nodes

You can also collect nodes so that you may use the list of nodes for later processing in the query. Here is a query where we create Language nodes from data in the Movie nodes:

cypher
MATCH (m:Movie)
UNWIND m.languages AS language
WITH language, collect(m) AS movies
MERGE (l:Language {name:language})
WITH l, movies
UNWIND movies AS m
WITH l,m
MERGE (l)<-[:SPEAKS]-(m)
// NOTE: to undo these merges:
// MATCH (x:Language) DETACH DELETE x

Here you see in this code the following:

  1. The first MATCH retrieves all Movie nodes.

  2. Each value in the languages property is returned.

  3. We then collect the movie nodes and use the value of language as the grouping key. That is all of the movies with a value of English will be collected into a list.

  4. We then create the Language node using the language value.

  5. With the language value and the list of Movie nodes for that language, we unwind the movie nodes into rows.

  6. With each language and row for each Movie, we create the relationship between the Language node and the Movie node.

This code collects nodes using the language value so that each node can be used to create the relationship. In this type of query, you typically use collect() to collect nodes and UNWIND lists so they can be processed later in the query.

Collecting relationships

Here is an example where we collect relationships, then unwind them to query and return the correct values:

cypher
MATCH (u:User {name: "Misty Williams"})-[x]->()
WITH collect(x) AS ratings
UNWIND ratings AS r
WITH r WHERE r.rating <= 1
RETURN r.rating AS Rating, endNode(r).title AS Title

Notice here that we use endNode() to return the node at the end of the relationship.

Collecting paths in a query

Here is an example where we perform a query and use the nodes() function to return the list of all nodes in a path:

cypher
// Movies in all paths that begin with Elvis and end with an Actor that are 4 hops away from Elvis
MATCH path = (p:Person {name: 'Elvis Presley'})-[*4]-(a:Actor)
WITH nodes(path) AS n
UNWIND n AS x
WITH x WHERE x:Movie
RETURN DISTINCT x.title

In the above query, there is a mixture of Person and Movie nodes in the path. We specify a criteria x:Movie that we only want to return Movie nodes.

collect() vs. subquery

A subquery is a another way to aggregate data for later processing in your queries. Suppose we have this query which returns a list of at most three actors for every movie:

cypher
PROFILE MATCH (m:Movie)<-[:ACTED_IN]-(p)
WITH m, collect(p.name) AS Actors
WHERE size(Actors) <= 3
RETURN m.title AS Movie, Actors

This query returns 208 movies that have 3 or fewer actors.

The query can be rewritten as follows using a subquery:

cypher
PROFILE
CALL {
MATCH (m:Movie)<-[:ACTED_IN]-(p)
WITH  m , collect(p.name) as Actors
WHERE size(Actors)<= 3
RETURN  m.title as Movie, Actors
}
RETURN Movie, Actors

Both queries perform the same.

Check your understanding

1. Aggregating data

Which Cypher function is most commonly used to aggregate data into a list?

  • count()

  • aggregate()

  • collect()

  • gather()

Hint

This Cypher function is typically used to collect data as a list.

Solution

You use the collect() function to create a list. A list typically contains elements of the same type, but this is not a requirement.

2. Ordering results returned

Here is a query that returns values to the client for movies released in 2000 and 2001.

cypher
MATCH (m:Movie)
WHERE  2000 <= m.year <= 2001
MATCH (d:Person)-[:DIRECTED]->(m)
MATCH (a:Person)-[:ACTED_IN]->(m)
RETURN m.year AS Year, m.title AS Title, m.released  AS ReleaseDate, collect(DISTINCT a.name) AS Actors,
collect(DISTINCT d.name) AS Directors

The results are returned so that all rows for the year 2000 are returned first, then all rows for the year 2001. When results are returned for aggregated results such as Actors and Directors, what is m.year considered?

  • ❏ Sort key

  • ✓ Grouping key

  • ❏ Sort property

  • ❏ Primary key

Hint

In a Cypher RETURN clause this key is used to sort rows by a common value.

Solution

The correct answer is Grouping key.

Summary

In this lesson, you reviewed and learned more about aggregating data in your Cypher queries. In the next challenge, you will create a query to aggregate using collect().