Understanding graph data science

Now that you understand what graph data science is, let’s explore what makes it different from traditional data science approaches.

How graph databases differ from traditional databases

Most data science workflows store and analyze tabular data.

In a tabular database, entities live in rows and their properties live in columns. If you want to access relationships between entities, you need explicit joins or reformatted tables. This refactoring isn’t fast, and it’s not computationally efficient either.

Graph databases take a different approach: they store relationships with the same level of importance as entities.

Displaying a WORKS_AT relationship between Michael and Neo4j and a FOUNDED_IN relationship between Neo4j and Sweden

A graph lets you view, understand, and analyze relationships between every data point in your database. This provides context for analyses that would be impossible in a tabular format.

Graph databases are optimized for efficient relationship traversal—and it’s much easier to see this than describe it.

You’ll work with a dataset of movies, actors, directors, users, and genres. Run the query below to see its data model.

cypher

Visualize the database schema

CALL db.schema.visualization()

If you have taken any other courses in the GraphAcademy catalog, you should already be familiar with this dataset. It’s essentially a pared down version of IMDb.

This Cypher query searches for the top ten shortest paths between Charlie Chaplin and Kevin Bacon. Click 'Run' to see what happens.

cypher

Find shortest paths between Charlie Chaplin and Kevin Bacon

MATCH path = SHORTEST 10 (chaz:Person {name: 'Charlie Chaplin'})
    -[:ACTED_IN*..8]-
        (kevin:Person {name:'Kevin Bacon'})
RETURN [n IN nodes(path) | COALESCE(n.title, n.name)]
    AS worked_with
ORDER BY size(worked_with) ASC

The query returns a table with ten rows. The first row shows the shortest path between Charlie Chaplin and Kevin Bacon in the movies network. The full table returns the top ten shortest paths between them, ordered from shortest to longest.

This query returns the same results as a graph visualization.

cypher

Visualize paths between Charlie Chaplin and Kevin Bacon

MATCH path = SHORTEST 10 (chaz:Person {name: 'Charlie Chaplin'})
    -[:ACTED_IN]-+
        (kevin:Person {name:'Kevin Bacon'})
RETURN path

Both queries finish in well under a second. Even for something this simple, a tabular approach would require serious refactoring or a series of joins to get the same data.

When you analyze the structure of information—not just properties alone—you can gain insights about user behaviors, identify supply chain bottlenecks, predict service failures, and more.

How graph data science differs from data science

In the previous example, you searched for paths between actors using Cypher. GDS takes this many steps further.

Imagine you want to create a recommendation engine for a content platform, but all your transactional data is tabular.

user_id	item_id	user_rating
user_01	item_422437	4
user_01	item_752639	5
user_01	item_782392	3
user_01	item_125565	2

user_id

item_id

user_rating

user_01

item_422437

user_01

item_752639

user_01

item_782392

user_01

item_125565

Just because a user enjoyed three Liam Neeson action films doesn’t mean they love Liam Neeson — or even action movies in general. Recommending movies based only on metadata similarities and user ratings is both naive and extremely common.

An effective recommendation engine would analyze user behaviors alongside user and movie metadata. It would also account for how those data interact — but let’s not get ahead of ourselves.

To assess behavioral similarities between users in a tabular format, you’d need to reshape the dataset into multiple tables, each with a specific purpose.

user_id	item_422437	item_752639	item_782392	item_125565
user_01	4	5	3	2
user_02	3	5	1	1
user_03	4	4	4	3
user_04	1	1	2	1

Then, to account for various patterns, you would need to use expensive JOIN operations, which can only get you so far.

Graph data represents relationships explicitly. You don’t need one table to connect users to actors, another to connect actors to movies, and another to relate metadata to user journeys. You need only one dataset to store and retrieve all of these perspectives at once.

The image below shows Liam Neeson’s node in the movie graph—he’s the large red one. The mid-sized orange nodes are movies he appeared in. The smaller blue nodes are users who rated those movies.

a graph network showing the connections between Liam Neeson

Run this query to explore the graph yourself.

cypher

Users who rated movies starring Liam Neeson

MATCH path = (a:Actor {name: 'Liam Neeson'}) // (1)
    -[:ACTED_IN]->(m:Movie) // (2)
    <-[:RATED]-(u:User) // (3)
RETURN path // (4)

Query breakdown

Find the Actor node for Liam Neeson
Follow ACTED_IN relationships to Movie nodes
Follow RATED relationships from User nodes back to those movies
Return the complete path showing all connections

You can click on any node in the returned graph to see its properties.

The data format is powerful alone, even using pure Cypher for analysis. However, GDS gives you the ability to analyse your data at various levels of granularity and from a wide range of approaches.

Let’s see that in action.

The following GDS procedures will:

Create a subgraph of Users, Movies, and Genres
Group Users and Movies into communities
Return groups of users with similar watching habits

Run them in sequence and check out the results.

This first query creates a new subgraph (an in-memory projection) from the larger main graph, containing only Users, Movies, Genres, and their relationships.

cypher

Create a subgraph of Users, Movies, and Genres

MATCH (source) // (1)
WHERE source:User OR source:Movie OR source:Genre // (2)
OPTIONAL MATCH (source)-[r]->(target) // (3)
WHERE target:User OR target:Movie OR target:Genre // (4)
WITH gds.graph.project( // (5)
  'rec-simple',
  source,
  target,
  {},
  {
    undirectedRelationshipTypes: ['*']
  }
) AS g
RETURN g.graphName, g.nodeCount, g.relationshipCount // (6)

Projection breakdown

Match all nodes in the database
Filter to only User, Movie, or Genre nodes
Optionally match relationships from these nodes
Filter targets to only User, Movie, or Genre nodes
Project an in-memory graph with undirected relationships
Return the projection details

This second query runs an algorithm called Leiden, which analyzes how nodes connect and clusters them into meaningful groups.

cypher

Run the Leiden community detection algorithm

CALL gds.leiden.mutate('rec-simple', { // (1)
    mutateProperty: 'communityId', // (2)
    maxLevels: 10, // (3)
    gamma: 1, // (4)
    theta: 0.05}) // (5)

Algorithm breakdown

Run Leiden algorithm on the 'rec-simple' projection
Store community assignments as 'communityId' property
Allow up to 10 hierarchical levels of community detection
Resolution parameter controlling community size
Threshold for community quality improvement

This final query retrieves nodes in the same group and presents them as a table. The results show groups of users whose watching behaviors are more similar to each other than to users outside their group.

cypher

Retrieve and display community members

CALL gds.graph.nodeProperties.stream('rec-simple', 'communityId') // (1)
    YIELD nodeId, propertyValue // (2)
WITH gds.util.asNode(nodeId).name AS node, propertyValue AS communityId // (3)
WITH communityId, collect(node) AS members // (4)
WITH communityId, members, size(members) AS communitySize // (5)
RETURN communityId, communitySize, members // (6)
ORDER BY communitySize DESC // (7)

Query breakdown

Stream the communityId property from the projection
Get each node’s ID and community assignment
Convert node IDs to names and rename the property
Group nodes by community and collect their names
Calculate the size of each community
Return community details
Order by largest communities first

Right now, you might not fully understand what you just did. That’s okay. By the end of this course, you’ll be able to use Graph Data Science to solve a variety of problems.

Summary

Graph databases store relationships with the same level of importance as entities, enabling analyses that would be impossible in a tabular format. When you analyze the structure of your data—not just individual properties—you can reveal insights about behaviors, patterns, and connections.

GDS extends this power by providing algorithms that can detect communities, measure influence, find optimal paths, and much more.

In the next lesson, you’ll learn how to set up your own GDS environment so you can explore these concepts hands-on.

Check your understanding

Relationship traversal efficiency

Why are graph databases more efficient for relationship traversal?

Graph databases have larger storage capacity than tabular databases
✓ Graph databases don’t require joins or table reformatting to access relationships
Tabular databases cannot store relationship data at all
Graph databases are faster at storing individual properties

Hint

Think about what you need to do in a tabular database to find connections between entities.

Solution

Graph databases store relationships with the same importance as entities, allowing direct traversal without joins or table reformatting. Tabular databases require explicit joins or reshaping to access relationships, which is neither fast nor computationally efficient.

GDS advantages

What advantage does GDS provide over pure Cypher?

GDS is faster at storing data than Cypher
GDS replaces the need for Cypher queries entirely
✓ GDS enables analyzing data at various levels of granularity with specialized algorithms
GDS can only work with tabular data formats

Hint

Consider what the Leiden algorithm did in the example—could you easily replicate that with standard Cypher queries?

Solution

While Cypher is powerful for querying graph data, GDS provides specialized algorithms that can analyze your data at various levels of granularity and from a wide range of approaches. For example, GDS algorithms can detect communities, measure influence, and find patterns that would be extremely complex to implement with Cypher alone.

Get started with Graph Data Science

Get started with the Graph Data Science library

GDS basic concepts

Working with algorithms

Essential projection techniques

Understanding graph data science

How graph databases differ from traditional databases

How graph data science differs from data science

Summary

Check your understanding

Relationship traversal efficiency

GDS advantages

Chatbot