Aggregation in Cypher
In this lesson you will learn how aggregation works in a Cypher statement and how to work with properties that contain multiple values.
Using count()
to aggregate data
Cypher has a count()
function that you can use to perform a count of nodes, relationships, paths, rows during query processing.
When you aggregate in a Cypher statement, this means that the query must process all patterns in the MATCH
clause to
complete the aggregation to either return results or perform the next part of the query.
Here is an example:
MATCH (a:Person)-[:ACTED_IN]->(m:Movie)
WHERE a.name = 'Tom Hanks'
RETURN a.name AS actorName,
count(*) AS numMovies
This query returns the number of movies Tom Hanks acted in.
Here is another example:
MATCH (a:Person)-[:ACTED_IN]->(m:Movie)<-[:DIRECTED]-(d:Person)
RETURN a.name AS actorName,
d.name AS directorName,
count(*) AS numMovies
ORDER BY numMovies DESC
In this query we can see that actors and directors worked together in the same movie and how many collaborations there were.
Eager aggregation
When you aggregate in a Cypher statement, the query must process all patterns in the MATCH
clause to complete the aggregation to return results.
This is called eager aggregation.
Here is an example:
MATCH (a:Person)-[:ACTED_IN]->(m:Movie)<-[:DIRECTED]-(d:Person)
RETURN a.name, d.name,
count(*) AS numMovies
ORDER BY numMovies DESC
The query engine processes all nodes and relationships in the pattern so that it can perform a count of all movies for a particular actor/director pair in the graph. Then, it returns the results grouped by the name of the director and actor pair.
For this query:
-
Clint Eastwood acted in and directed 20 movies.
-
Woody Allen acted in and directed 19 movies.
-
Adam Sandler acted in the same 8 movies that Dennis Dugan directed.
Aggregation in Cypher is different from aggregation in SQL.
In Cypher, you need not specify a grouping key.
As soon as an aggregation function like count()
is used, all non-aggregated result columns become grouping keys.
The grouping is implicitly done, based upon the fields in the RETURN
clause.
If you specify count(n)
, the graph engine calculates the number of non-null occurrences of n.
If you specify count(*)
, the graph engine calculates the number of rows retrieved, including those with null values.
Lists in the graph
You have already seen data in the graph that is stored as lists:
MATCH (m:Movie)
RETURN m.languages AS languages,
m.countries AS countries
LIMIT 10
For this query, notice that these lists are shown in square brackets and each element of the list is separated by a comma. These lists contain a set of strings.
Returning a list
You can return a list by specifying the square brackets:
MATCH (p:Person)
RETURN p.name, [p.born, p.died] AS lifeTime
LIMIT 10
This query returns the name, and the lifeTime list that includes the born and died values for each person.
Using collect() to create a list
Cypher has a built-in aggregation function, collect()
that enables you to aggregate values into a list.
The value can be any expression, for instance a property value, a node, or result of a function or operation.
MATCH (a:Person)-[:ACTED_IN]->(m:Movie)
RETURN a.name AS actor,
count(*) AS total,
collect(m.title) AS movies
ORDER BY total DESC LIMIT 10
This query returns a list of movie titles associated with each actor. The rows are presented such that actors with the greatest number of titles are returned first.
Eliminating duplication in lists
Just as you have learned to eliminate duplication in data returned using DISTINCT
, you can also eliminate duplication in a list.
Here is a query that has duplication in its result:
MATCH (a:Person)-[:ACTED_IN]->(m:Movie)
WHERE m.year = 1920
RETURN collect(m.title) AS movies,
collect( a.name) AS actors
When this query executes, it returns duplicate titles in the Movies list.
You can eliminate duplication in the list as follows:
MATCH (a:Person)-[:ACTED_IN]->(m:Movie)
WHERE m.year = 1920
RETURN collect( DISTINCT m.title) AS movies,
collect( a.name) AS actors
Collecting nodes
Rather than collecting the values of the title properties for movies, you can collect the nodes. For this simple query, it is the same visual result as returning m, but for more complex queries, you will learn later in this course that collecting nodes and using them for a later step of the query is useful.
MATCH (p:Person)-[:ACTED_IN]->(m:Movie)
WHERE p.name ='Tom Cruise'
RETURN collect(m) AS tomCruiseMovies
This query returns a list of all Movie nodes for Tom Cruise. In table view, you can see that the returned data is a list.
If you were to replace RETURN collect(m) AS `tomCruiseMovies
clause with RETURN m
, you would see that in table view the returned data is no longer a list.
Accessing elements of a list
You can access particular elements of the list using the [index-value]
notation where a list begins with index 0.
In this example we return the first cast member for each movie.
MATCH (a:Person)-[:ACTED_IN]->(m:Movie)
RETURN m.title AS movie,
collect(a.name)[0] AS castMember,
size(collect(a.name)) as castSize
You can also return a slice of a collection.
MATCH (a:Person)-[:ACTED_IN]->(m:Movie)
RETURN m.title AS movie,
collect(a.name)[2..] AS castMember,
size(collect(a.name)) as castSize
This query returns the second to the end of the list names of actors.
You can read about the many ways you can access and work with lists in Cypher such as using head(), tail(), etc. in the Neo4j Cypher Manual.
Other aggregating functions
There are more aggregating functions that may be useful for your application such as:
-
min()
-
max()
-
avg()
-
stddev()
-
sum()
These are described in the Aggregating Functions section of the Neo4j Cypher Manual.
count()
versus size()
You can either use count()
to count the number of rows, or alternatively, you can return the size of the collected results.
The size()
function returns the number of elements in a list.
Here we use size()
to return the number of elements in the list:
MATCH (actor:Person)-[:ACTED_IN]->(m:Movie)<-[:DIRECTED]-(director:Person)
RETURN actor.name, director.name,
size(collect(m)) AS collaborations,
collect(m.title) AS movies
As you gain more experience with query tuning, you will learn that count()
may be more efficient because it gets its values for node counts or
relationships from a node from the internal count store of the graph.
If we profile this query (run it at least twice):
//Query 1
PROFILE MATCH (g:Genre)<-[:IN_GENRE]-(m)
RETURN g.name AS genre,
size(collect(m)) AS numMovies
ORDER BY size(collect(m)) DESC
It returns the same results (table view) as this query. This query uses count():
//Query 2
PROFILE MATCH (g:Genre)<-[:IN_GENRE]-(m:Movie)
RETURN g.name AS genre,
count(m) AS numMovies
ORDER BY count(m) DESC
The Query 1 has different performance characteristics. You should always test the performance of your queries before you deploy your application.
And you can answer the question, What genre of movies is the highest and how many movies are in that genre?
MATCH (g:Genre)<-[:IN_GENRE]-(m:Movie)
RETURN g.name AS genre,
count(m) AS numMovies
ORDER BY count(m) DESC LIMIT 1
List comprehension
You can create a list by evaluating an expression that tests for list inclusion.
Here is an example:
MATCH (m:Movie)
RETURN m.title as movie,
[x IN m.countries WHERE x CONTAINS 'USA' OR x CONTAINS 'Germany']
AS country LIMIT 500
Pattern comprehension
Pattern comprehension is a very powerful way to create lists without changing the cardinality of the query.
It behaves like an OPTIONAL MATCH
combined with collect()
.
Here is an example:
MATCH (m:Movie)
WHERE m.year = 2015
RETURN m.title,
[(dir:Person)-[:DIRECTED]->(m) | dir.name] AS directors,
[(actor:Person)-[:ACTED_IN]->(m) | actor.name] AS actors
Notice that for pattern comprehension we specify the list with the square braces to include the pattern followed by the pipe character to then specify what value will be placed in the list from the pattern.
[<pattern> | value]
This query returns the movies released in 2015. Each row contains the title of the movie, the list of director names, and the list of actor names.
Here is another example of using pattern comprehension to create a list where we specify a filter for the pattern.
MATCH (a:Person {name: 'Tom Hanks'})
RETURN [(a)-->(b:Movie)
WHERE b.title CONTAINS "Toy" | b.title + ": " + b.year]
AS movies
For this pattern comprehension, the title of the movie is concatenated with the year of the movie as a value to add as an element of the list returned.
Working with maps
A Cypher map
is list of key/value pairs where each element of the list is of the format 'key': value.
A node or relationship can have a property that is a map.
For example, a map of months and the number of days per month could be:
{Jan: 31, Feb: 28, Mar: 31, Apr: 30 , May: 31, Jun: 30 , Jul: 31, Aug: 31, Sep: 30, Oct: 31, Nov: 30, Dec: 31}
Using this map, we can return the value for one of its elements:
RETURN {Jan: 31, Feb: 28, Mar: 31, Apr: 30 ,
May: 31, Jun: 30 , Jul: 31, Aug: 31, Sep: 30,
Oct: 31, Nov: 30, Dec: 31}['Feb'] AS daysInFeb
Here we use the key, 'Feb'
to access its value.
Alternatively, you can access a value with the '.' notation:
RETURN {Jan: 31, Feb: 28, Mar: 31, Apr: 30 ,
May: 31, Jun: 30 , Jul: 31, Aug: 31, Sep: 30,
Oct: 31, Nov: 30, Dec: 31}.Feb AS daysInFeb
And you can return a list of keys of a map as follows:
RETURN keys({Jan: 31, Feb: 28, Mar: 31, Apr: 30 ,
May: 31, Jun: 30 ,Jul: 31, Aug: 31, Sep: 30,
Oct: 31, Nov: 30, Dec: 31}) AS months
A node in the graph, when returned in Neo4j Browser is a map, when displayed as table rows. For example, a Movie node:
MATCH (m:Movie)
RETURN m LIMIT 1
Read more in the Neo4j Cypher Manual about maps.
Map projections
Map projections are when you can use retrieved nodes to create or return some of the information in the nodes. A Movie node can have the properties title, released, and tagline. Suppose we want to return the Movie node information, but without the tagline property? You can do so using map projections:
MATCH (m:Movie)
WHERE m.title CONTAINS 'Matrix'
RETURN m { .title, .released } AS movie
Check your understanding
1. List of actors in the movie Toy Story
We want to return the list of names of actors in the movie Toy Story as a single row. What code do you use?
Once you have selected your option, click the Check Results query button to continue.
MATCH (movie:Movie {title:'Toy Story'})<-[:ACTED_IN]-(actor:Person)
/*select:RETURN collect(actor.name) AS actors*/
-
❏
RETURN actor.name AS actors
-
❏
RETURN collect(actor) AS actors
-
❏
RETURN list(actor.name) AS actors
-
✓
RETURN collect(actor.name) AS actors
Hint
You want to return a single row that contains the names of actors as a list.
Solution
RETURN collect(actor.name) AS actors
is the correct answer.
RETURN collect(actor) AS actors
returns a list of nodes.
`RETURN actor.name AS actors` will return multiple rows, each with a name of an actor.
`list()` is not a valid Cypher function.
2. Aggregating data
What Cypher function can you use to return the number of elements in a list of Movie nodes, movies?
-
❏
size(collect(movies))
-
✓
size(movies)
-
❏
count(movies)
-
❏
count(collect(movies))
Hint
This function is used to return the size of a list. The variable, movies is a list.
Solution
size(movies)
is the correct answer.
size(collect(movies))
is incorrect because movies is already a list so you do not need to create another list.
count()
is used to count the number of rows or variables returned, but is not used to return the number of elements in a list.
Summary
In this lesson, you learned how to create lists, how aggregation works in Cypher, counting, and using maps.
You can learn more about Cypher aggregation in the course: Cypher Aggregations
In the next challenge, you will write a query to create a list.