Aggregating Data

Aggregation in Cypher

In this lesson you will learn how aggregation works in a Cypher statement and how to work with properties that contain multiple values.

Using count() to aggregate data

Cypher has a count() function that you can use to perform a count of nodes, relationships, paths, rows during query processing. When you aggregate in a Cypher statement, this means that the query must process all patterns in the MATCH clause to complete the aggregation to either return results or perform the next part of the query.

Here is an example:

cypher
MATCH (a:Person)-[:ACTED_IN]->(m:Movie)
WHERE a.name = 'Tom Hanks'
RETURN a.name AS actorName,
count(*) AS numMovies

This query returns the number of movies Tom Hanks acted in.

Here is another example:

cypher
MATCH (a:Person)-[:ACTED_IN]->(m:Movie)<-[:DIRECTED]-(d:Person)
RETURN a.name AS actorName,
d.name AS directorName,
count(*) AS numMovies
ORDER BY numMovies DESC

In this query we can see that actors and directors worked together in the same movie and how many collaborations there were.

Eager aggregation

When you aggregate in a Cypher statement, the query must process all patterns in the MATCH clause to complete the aggregation to return results. This is called eager aggregation.

Here is an example:

cypher
MATCH (a:Person)-[:ACTED_IN]->(m:Movie)<-[:DIRECTED]-(d:Person)
RETURN a.name, d.name,
count(*) AS numMovies
ORDER BY numMovies DESC

The query engine processes all nodes and relationships in the pattern so that it can perform a count of all movies for a particular actor/director pair in the graph. Then, it returns the results grouped by the name of the director and actor pair.

For this query:

  • Clint Eastwood acted in and directed 20 movies.

  • Woody Allen acted in and directed 19 movies.

  • Adam Sandler acted in the same 8 movies that Dennis Dugan directed.

Aggregation in Cypher is different from aggregation in SQL. In Cypher, you need not specify a grouping key. As soon as an aggregation function like count() is used, all non-aggregated result columns become grouping keys.

The grouping is implicitly done, based upon the fields in the RETURN clause.

If you specify count(n), the graph engine calculates the number of non-null occurrences of n. If you specify count(*), the graph engine calculates the number of rows retrieved, including those with null values.

Lists in the graph

You have already seen data in the graph that is stored as lists:

cypher
MATCH (m:Movie)
RETURN m.languages AS languages,
m.countries AS countries
LIMIT 10

For this query, notice that these lists are shown in square brackets and each element of the list is separated by a comma. These lists contain a set of strings.

Returning a list

You can return a list by specifying the square brackets:

cypher
MATCH (p:Person)
RETURN p.name, [p.born, p.died] AS lifeTime
LIMIT 10

This query returns the name, and the lifeTime list that includes the born and died values for each person.

Using collect() to create a list

Cypher has a built-in aggregation function, collect() that enables you to aggregate values into a list. The value can be any expression, for instance a property value, a node, or result of a function or operation.

cypher
MATCH (a:Person)-[:ACTED_IN]->(m:Movie)
RETURN a.name AS actor,
count(*) AS total,
collect(m.title) AS movies
ORDER BY total DESC LIMIT 10

This query returns a list of movie titles associated with each actor. The rows are presented such that actors with the greatest number of titles are returned first.

Eliminating duplication in lists

Just as you have learned to eliminate duplication in data returned using DISTINCT, you can also eliminate duplication in a list.

Here is a query that has duplication in its result:

cypher
MATCH (a:Person)-[:ACTED_IN]->(m:Movie)
WHERE m.year = 1920
RETURN  collect(m.title) AS movies,
collect( a.name) AS actors

When this query executes, it returns duplicate titles in the Movies list.

You can eliminate duplication in the list as follows:

cypher
MATCH (a:Person)-[:ACTED_IN]->(m:Movie)
WHERE m.year = 1920
RETURN  collect( DISTINCT m.title) AS movies,
collect( a.name) AS actors

Collecting nodes

Rather than collecting the values of the title properties for movies, you can collect the nodes. For this simple query, it is the same visual result as returning m, but for more complex queries, you will learn later in this course that collecting nodes and using them for a later step of the query is useful.

cypher
MATCH (p:Person)-[:ACTED_IN]->(m:Movie)
WHERE p.name ='Tom Cruise'
RETURN collect(m) AS tomCruiseMovies

This query returns a list of all Movie nodes for Tom Cruise. In table view, you can see that the returned data is a list.

If you were to replace RETURN collect(m) AS `tomCruiseMovies clause with RETURN m, you would see that in table view the returned data is no longer a list.

Accessing elements of a list

You can access particular elements of the list using the [index-value] notation where a list begins with index 0.

In this example we return the first cast member for each movie.

cypher
MATCH (a:Person)-[:ACTED_IN]->(m:Movie)
RETURN m.title AS movie,
collect(a.name)[0] AS castMember,
size(collect(a.name)) as castSize

You can also return a slice of a collection.

cypher
MATCH (a:Person)-[:ACTED_IN]->(m:Movie)
RETURN m.title AS movie,
collect(a.name)[2..] AS castMember,
size(collect(a.name)) as castSize

This query returns the second to the end of the list names of actors.

You can read about the many ways you can access and work with lists in Cypher such as using head(), tail(), etc. in the Neo4j Cypher Manual.

Other aggregating functions

There are more aggregating functions that may be useful for your application such as:

  • min()

  • max()

  • avg()

  • stddev()

  • sum()

These are described in the Aggregating Functions section of the Neo4j Cypher Manual.

count() versus size()

You can either use count() to count the number of rows, or alternatively, you can return the size of the collected results. The size() function returns the number of elements in a list.

Here we use size() to return the number of elements in the list:

cypher
MATCH (actor:Person)-[:ACTED_IN]->(m:Movie)<-[:DIRECTED]-(director:Person)
RETURN actor.name, director.name,
size(collect(m)) AS collaborations,
collect(m.title) AS movies

As you gain more experience with query tuning, you will learn that count() may be more efficient because it gets its values for node counts or relationships from a node from the internal count store of the graph.

If we profile this query (run it at least twice):

cypher
//Query 1
PROFILE MATCH (g:Genre)<-[:IN_GENRE]-(m)
RETURN g.name AS genre,
size(collect(m)) AS numMovies
ORDER BY size(collect(m)) DESC

It returns the same results (table view) as this query. This query uses count():

cypher
//Query 2
PROFILE MATCH (g:Genre)<-[:IN_GENRE]-(m:Movie)
RETURN g.name AS genre,
count(m) AS numMovies
ORDER BY count(m) DESC

The Query 1 has different performance characteristics. You should always test the performance of your queries before you deploy your application.

And you can answer the question, What genre of movies is the highest and how many movies are in that genre?

cypher
MATCH (g:Genre)<-[:IN_GENRE]-(m:Movie)
RETURN g.name AS genre,
count(m) AS numMovies
ORDER BY count(m) DESC LIMIT 1

List comprehension

You can create a list by evaluating an expression that tests for list inclusion.

Here is an example:

cypher
MATCH (m:Movie)
RETURN m.title as movie,
[x IN m.countries WHERE x CONTAINS 'USA' OR x CONTAINS 'Germany']
AS country LIMIT 500

Pattern comprehension

Pattern comprehension is a very powerful way to create lists without changing the cardinality of the query. It behaves like an OPTIONAL MATCH combined with collect().

Here is an example:

cypher
MATCH (m:Movie)
WHERE m.year = 2015
RETURN m.title,
[(dir:Person)-[:DIRECTED]->(m) | dir.name] AS directors,
[(actor:Person)-[:ACTED_IN]->(m) | actor.name] AS actors

Notice that for pattern comprehension we specify the list with the square braces to include the pattern followed by the pipe character to then specify what value will be placed in the list from the pattern.

[<pattern> | value]

This query returns the movies released in 2015. Each row contains the title of the movie, the list of director names, and the list of actor names.

Here is another example of using pattern comprehension to create a list where we specify a filter for the pattern.

cypher
MATCH (a:Person {name: 'Tom Hanks'})
RETURN [(a)-->(b:Movie)
WHERE b.title CONTAINS "Toy" | b.title + ": " + b.year]
AS movies

For this pattern comprehension, the title of the movie is concatenated with the year of the movie as a value to add as an element of the list returned.

Working with maps

A Cypher map is list of key/value pairs where each element of the list is of the format 'key': value. A node or relationship can have a property that is a map.

For example, a map of months and the number of days per month could be:

{Jan: 31, Feb: 28, Mar: 31, Apr: 30 , May: 31, Jun: 30 , Jul: 31, Aug: 31, Sep: 30, Oct: 31, Nov: 30, Dec: 31}

Using this map, we can return the value for one of its elements:

cypher
RETURN {Jan: 31, Feb: 28, Mar: 31, Apr: 30 ,
May: 31, Jun: 30 , Jul: 31, Aug: 31, Sep: 30,
Oct: 31, Nov: 30, Dec: 31}['Feb'] AS daysInFeb

Here we use the key, 'Feb' to access its value.

Alternatively, you can access a value with the '.' notation:

cypher
RETURN {Jan: 31, Feb: 28, Mar: 31, Apr: 30 ,
May: 31, Jun: 30 , Jul: 31, Aug: 31, Sep: 30,
Oct: 31, Nov: 30, Dec: 31}.Feb AS daysInFeb

And you can return a list of keys of a map as follows:

cypher
RETURN keys({Jan: 31, Feb: 28, Mar: 31, Apr: 30 ,
May: 31, Jun: 30 ,Jul: 31, Aug: 31, Sep: 30,
Oct: 31, Nov: 30, Dec: 31}) AS months

A node in the graph, when returned in Neo4j Browser is a map, when displayed as table rows. For example, a Movie node:

cypher
MATCH (m:Movie)
RETURN m LIMIT 1

Read more in the Neo4j Cypher Manual about maps.

Map projections

Map projections are when you can use retrieved nodes to create or return some of the information in the nodes. A Movie node can have the properties title, released, and tagline. Suppose we want to return the Movie node information, but without the tagline property? You can do so using map projections:

cypher
MATCH (m:Movie)
WHERE m.title CONTAINS 'Matrix'
RETURN m { .title, .released } AS movie

Check your understanding

1. List of actors in the movie Toy Story

We want to return the list of names of actors in the movie Toy Story as a single row. What code do you use?

Once you have selected your option, click the Check Results query button to continue.

cypher
MATCH (movie:Movie {title:'Toy Story'})<-[:ACTED_IN]-(actor:Person)
/*select:RETURN collect(actor.name) AS actors*/
  • RETURN actor.name AS actors

  • RETURN collect(actor) AS actors

  • RETURN list(actor.name) AS actors

  • RETURN collect(actor.name) AS actors

Hint

You want to return a single row that contains the names of actors as a list.

Solution

RETURN collect(actor.name) AS actors is the correct answer.

RETURN collect(actor) AS actors returns a list of nodes.

`RETURN actor.name AS actors` will return multiple rows, each with a name of an actor.
`list()` is not a valid Cypher function.

2. Aggregating data

What Cypher function can you use to return the number of elements in a list of Movie nodes, movies?

  • size(collect(movies))

  • size(movies)

  • count(movies)

  • count(collect(movies))

Hint

This function is used to return the size of a list. The variable, movies is a list.

Solution

size(movies) is the correct answer.

size(collect(movies)) is incorrect because movies is already a list so you do not need to create another list.

count() is used to count the number of rows or variables returned, but is not used to return the number of elements in a list.

Summary

In this lesson, you learned how to create lists, how aggregation works in Cypher, counting, and using maps.

You can learn more about Cypher aggregation in the course: Cypher Aggregations

In the next challenge, you will write a query to create a list.