Centrality and Importance

Introduction

Centrality algorithms are used to determine the importance of distinct nodes in a graph.

Common use cases of centrality include:

  • Recommendations: Identify and recommend the most influential or popular items in your content or product offering catalog

  • Supply chain analytics: find the most critical node in your supply chain, whether it be a supplier in a network, a raw material that is part of a manufactured product, or a port in a route

  • Fraud & Anomaly Detection: Find users with many shared identifiers or who otherwise act as a bridge between many communities

Degree Centrality Example

Degree centrality is one of the most ubiquitous and simple centrality algorithms. It counts the number of relationships a node has. In the GDS implementation, we specifically calculate out-degree centrality which is the count of outgoing relationships from a node. Below is an example of using degree centrality to count the number of movies each actor has acted in.

During this module, you will be using a movie recommendations dataset that contains information about movies, actors, and users who have rated movies.

The recommendations data model showing Movie

To use a GDS algorithm, you must first create a graph projection. A projection is an in-memory graph you can quickly query and manipulate.

You can learn more about graph projections in the GraphAcademy course Introduction to Neo4j Graph Data Science.

Create a graph projection of Actor and Movie nodes:

cypher
CALL gds.graph.project(
  'proj',
  ['Actor','Movie'],
  'ACTED_IN'
  );

Then stream the degree centrality to find the actors who have acted in the most movies:

cypher
//get top 5 most prolific actors (those in the most movies)
//using degree centrality which counts number of `ACTED_IN` relationships
CALL gds.degree.stream('proj')
YIELD nodeId, score
RETURN
  gds.util.asNode(nodeId).name AS actorName,
  score AS numberOfMoviesActedIn
ORDER BY numberOfMoviesActedIn DESCENDING, actorName LIMIT 5

The top three actors should be "Robert De Niro", "Bruce Willis", and "Nicolas Cage."

PageRank Example

Another common centrality algorithm is PageRank. PageRank is a good algorithm for measuring the influence of nodes in a directed graph, particularly where the relationships imply some form of flow of movement such as in payment networks, supply chain and logistics, communications, routing, and graphs of website and links.

PageRank was originally developed by Google co-founders Larry Page and Sergey Brin at Stanford University in 1996 as part of a research project about a new kind of search engine. It has since been used by Google Search to rank web pages in their search engine results.

In summary, PageRank estimates the importance of a node by counting the number of incoming relationships from neighboring nodes weighted by the importance and out-degree centrality of those neighbors. The underlying assumption is that more important nodes are likely to have proportionately more incoming relationships from other important nodes. Our PageRank documentation offers a thorough technical explanation of PageRank if you are interested in digging in deeper.

Below is an example of applying PageRank to find the most influential persons in the Director → Actor network from movies released on or after 1990 with a revenue of at least 10 Million dollars.

First, create the graph projection. We can use a Cypher projection to create an in-memory graph with :DIRECTED_ACTOR relationships between two (:Person) nodes. This graph can be traversed to understand the influence across directors and actors.

cypher
// drop last graph projection
CALL gds.graph.drop('proj', false);

// create Cypher projection for network of people directing actors
// filter to recent high grossing movies
MATCH (source:Person)-[:DIRECTED]->(m:Movie)<-[:ACTED_IN]-(target)
WHERE m.year >= 1990 AND m.revenue >= 10000000
WITH source, target, count(*) as actedWithCount
WITH gds.graph.project(
  'proj',
  source,
  target,
  {
    relationshipType: "DIRECTED_ACTOR"
  }
) as g
RETURN
  g.graphName AS graph, g.nodeCount AS node, g.relationshipCount AS rels

Next stream PageRank to find the top 5 most influential people in director-actor network.

cypher
CALL gds.pageRank.stream('proj')
YIELD nodeId, score
RETURN
  gds.util.asNode(nodeId).name AS personName,
  score AS influence
ORDER BY influence DESCENDING, personName LIMIT 5

The top three persons should be "Robert De Niro", "Greg Kinnear", and "Sandra Bullock."

Other Centrality Algorithms

Other GDS production tier centrality algorithms include:

  • Betweenness Centrality: Measures the extent to which a node stands between the other nodes in a graph. It is often used to find nodes that serve as a bridge from one part of a graph to another.

  • Eigenvector Centrality: Measures the transitive influence of nodes. Similar to PageRank, but works only on the largest eigenvector of the adjacency matrix so does not converge in the same way and tends to more strongly favor high degree nodes. It can be more appropriate in certain use cases, particularly those with undirected relationships.

  • Article Rank: A variant of PageRank which assumes that relationships originating from low-degree nodes have a higher influence than relationships from high-degree nodes.

A full list of centrality algorithms across all product tiers can be found in the Centrality section of the GDS documentation.

Check your understanding

1. Algorithm Purpose

What do centrality algorithms generally measure?

  • ❏ The importance of distinct relationships in a graph

  • ❏ The central tendency of clusters in a graph

  • ✓ The importance of distinct nodes in a graph

  • ❏ Which relationship paths are most central in the graph

Hint

Centrality algorithms generally measure the importance, influence, or prominence of nodes within a network/graph. They provide metrics to assess the relative significance of nodes based on their position and connections within the network.

Solution

Centrality algorithms general measure the importance of distinct nodes in a graph.

2. Name the Algorithm

Which centrality algorithm is based solely on a count of relationships for each node

  • ✓ degree centrality

  • ❏ pageRank

  • ❏ HITS

  • ❏ eigenvector centrality

Hint

The number of relationships into or out of a relationship can also be referred to as a degree count.

Solution

The degree centrality algorithm is a measure of centrality that calculates the importance of a node in a network based on the number of direct connections it has. It quantifies the number of edges connected to a node, indicating its level of influence or prominence within the network.

Summary

In this lesson you learned about centrality in GDS. We went over common use cases and how to run some of the most popular centrality algorithms in GDS.