Similarity

Introduction

Similarity algorithms, as the name implies, are used to infer similarity between pairs of nodes. In GDS these algorithms run over the graph projection in bulk. When similar node pairs are identified according to the user specified metric and threshold, a relationship with a similarity score property is drawn between the pair. Depending on which execution mode is used when running the algorithm, these similarity relationships can be streamed, mutated to the in-memory graph, or written back to the database.

Common use cases for similarity include:

Fraud detection: finding potential fraud user accounts by analyzing how similar a set of new user accounts is to flagged accounts
Recommendation Systems: In an online retail store, identifying items that pair to the one currently being viewed by a user to inform impressions and increase rate of purchase
Entity Resolution: Identify nodes that are similar to one another based on activity or identifying information in the graph

Similarity Algorithms in GDS

GDS has two primary similarity algorithms:

Node Similarity: Determines similarity between nodes based on the relative proportion of shared neighboring nodes in the graph. Node Similarity is a good choice where explainability is important, and you can narrow down the universe of comparisons to a subset of your data. Examples of narrowing down include focusing on just single communities, newly added nodes, or nodes within a specific proximity to a subgraph of interest.
K-Nearest Neighbor (KNN): Determines similarity based off node properties. The GDS KNN implementation can scale well for global inference over large graphs when tuned appropriately. it can be used in conjunction with embeddings and other graph algorithms to determine the similarity between nodes based on proximity in the graph, node properties, community structure, importance/centrality, etc.

Choice of Similarity Metric

Both Node Similarity and KNN provide choices between different similarity metrics. Node Similarity has choices between Jaccard and Overlap similarity. KNN choice of metric is driven by the node property types. List of integers are subject to Jaccard and Overlap, list of floating point numbers to Cosine Similarity, Pearson, and Euclidean. Using different metrics will of course alter the similarity score and change the interpretation slightly. You can read more about the different metrics for Node Similarity in the Node Similarity documentation and for KNN in the K-Nearest Neighbors documentation.

Controlling Scope of Comparisons

Comparing every node to every other node in the graph is a computationally expensive task of roughly O(n^2) complexity. The GDS implementations for both Node Similarity and KNN have internal mechanisms to intelligently select node pairs for comparison allowing them to work faster and scale better. They also have parameters that can be adjusted by the user to tune how node pairs are sampled and selected for comparison.

Node Similarity has a degreeCutOff parameter for nodes which allows you to set a lower limit on the degree centrality for nodes to be selected.

KNN has various parameters that can be tuned to affect the speed vs completeness trade-off of node comparisons, including sampling rate, initial sampler methodology, random join counts between iteration, and a few others. You can read more about how these work in the K-Nearest Neighbors documentation.

Controlling Scope of Results

For similarity comparisons we may also want to control the number of results returned to only consider the most relevant node pairs. Both Node Similarity and KNN have a topK parameter to limit the number similarity comparisons returned per node. With node similarity there is also the capability to limit the results globally as opposed to just a per node basis.

Applied Example with KNN

Let’s take the embeddings we calculated from the node embedding lesson and use them to determine similarity between the actors and directors based on movies they have been involved with. You can regenerate the projection with the below:

cypher

CALL gds.graph.project('proj', ['Movie', 'Person'], {
    ACTED_IN:{orientation:'UNDIRECTED'},
    DIRECTED:{orientation:'UNDIRECTED'}
});

We will then run FastRP like in the last lesson except in mutate mode so the embeddings will be saved in the projection.

cypher

CALL gds.fastRP.mutate('proj',  {
    embeddingDimension:64,
    randomSeed:7474,
    mutateProperty:'embedding'
})

After that we can run similarity. We will use the default cosine metric. For purposes of demonstration, I will limit topK to one so we can see the top pairs for each node.

cypher

CALL gds.knn.stream('proj', {nodeLabels:['Person'], nodeProperties:['embedding'], topK:1})
YIELD  node1, node2, similarity
RETURN gds.util.asNode(node1).name AS actorName1,
    gds.util.asNode(node2).name AS actorName2,
    similarity
LIMIT 10

Similarity Functions

In addition to the node similarity and KNN algorithms, GDS also provides a set of functions that can be used to calculate similarity between two arrays of numbers using various similarity metrics including jaccard, overlap, pearson, cosine similarity and a few others. The full documentation can be found in the Similarity Functions documentation. These functions are useful when you are interested in measuring similarity between a single select node pair at a time as opposed to calculating similarity over the entire graph.

Check your Understanding

1. Algorithm Purpose

What can similarity algorithms accomplish in GDS? (select all that apply)

❏ Finds similarity relationship pairs in the graph
✓ Find similarity between pairs of nodes based on the neighboring nodes that they share
✓ Calculate the similarity between node pairs based solely on the similarity of node properties
✓ Draw relationships to connect similar node pairs together in the graph

Hint

Similarity algorithms determine the similarity between pairs of nodes by examining shared neighboring nodes, evaluating the similarity based on node properties alone, and establishing connections between similar node pairs within the graph.

Solution

The correct answers are:

Find similarity between pairs of nodes based on the neighboring nodes that they share
Calculate the similarity between node pairs based solely on the similarity of node properties
Draw relationships to connect similar node pairs together in the graph

2. Name the Algorithm

Which Similarity algorithm can be used to find node pairs based off of node properties, including properties that were assigned from other graph algorithms like embeddings or community sizes?

❏ Cosine
❏ Node Similarity
✓ K-Nearest Neighbor (KNN)
❏ Dijkstra Source-Target

Hint

The algorithm you are looking for uses nodes connected to a source node, in other words its neighbors, to compute simularity.

Solution

The answer is K-Nearest Neighbor (KNN). You can Read more about the kNN algorithm in the GDS documentation.

Summary

In this lesson you learned about similarity algorithms and how to practically apply them in GDS.

In the next lesson, you will learn more about Node Embeddings.

Neo4j Graph Data Science Fundamentals

Graph Algorithms

Graph Machine Learning

Similarity

Introduction

Similarity Algorithms in GDS

Choice of Similarity Metric

Controlling Scope of Comparisons

Controlling Scope of Results

Applied Example with KNN

Similarity Functions

Check your Understanding

1. Algorithm Purpose

2. Name the Algorithm

Summary

Chatbot