Introduction
Similarity algorithms, as the name implies, are used to infer similarity between pairs of nodes. In GDS these algorithms run over the graph projection in bulk. When similar node pairs are identified according to the user specified metric and threshold, a relationship with a similarity score property is drawn between the pair. Depending on which execution mode is used when running the algorithm, these similarity relationships can be streamed, mutated to the in-memory graph, or written back to the database.
Common use cases for similarity include:
-
Fraud detection: finding potential fraud user accounts by analyzing how similar a set of new user accounts is to flagged accounts
-
Recommendation Systems: In an online retail store, identifying items that pair to the one currently being viewed by a user to inform impressions and increase rate of purchase
-
Entity Resolution: Identify nodes that are similar to one another based on activity or identifying information in the graph
Similarity Algorithms in GDS
GDS has two primary similarity algorithms:
-
Node Similarity: Determines similarity between nodes based on the relative proportion of shared neighboring nodes in the graph. Node Similarity is a good choice where explainability is important, and you can narrow down the universe of comparisons to a subset of your data. Examples of narrowing down include focusing on just single communities, newly added nodes, or nodes within a specific proximity to a subgraph of interest.
-
K-Nearest Neighbor (KNN): Determines similarity based off node properties. The GDS KNN implementation can scale well for global inference over large graphs when tuned appropriately. it can be used in conjunction with embeddings and other graph algorithms to determine the similarity between nodes based on proximity in the graph, node properties, community structure, importance/centrality, etc.
Choice of Similarity Metric
Both Node Similarity and KNN provide choices between different similarity metrics. Node Similarity has choices between Jaccard and Overlap similarity. KNN choice of metric is driven by the node property types. List of integers are subject to Jaccard and Overlap, list of floating point numbers to Cosine Similarity, Pearson, and Euclidean. Using different metrics will of course alter the similarity score and change the interpretation slightly. You can read more about the different metrics for Node Similarity in the Node Similarity documentation and for KNN in the K-Nearest Neighbors documentation.
Controlling Scope of Comparisons
Comparing every node to every other node in the graph is a computationally expensive task of roughly O(n^2)
complexity. The GDS implementations for both Node Similarity and KNN have internal mechanisms to intelligently select node pairs for comparison allowing them to work faster and scale better. They also have parameters that can be adjusted by the user to tune how node pairs are sampled and selected for comparison.
Node Similarity has a degreeCutOff
parameter for nodes which allows you to set a lower limit on the degree centrality for nodes to be selected.
KNN has various parameters that can be tuned to affect the speed vs completeness trade-off of node comparisons, including sampling rate, initial sampler methodology, random join counts between iteration, and a few others. You can read more about how these work in the K-Nearest Neighbors documentation.
Controlling Scope of Results
For similarity comparisons we may also want to control the number of results returned to only consider the most relevant node pairs. Both Node Similarity and KNN have a topK
parameter to limit the number similarity comparisons returned per node. With node similarity there is also the capability to limit the results globally as opposed to just a per node basis.
Applied Example with KNN
Let’s take the embeddings we calculated from the node embedding lesson and use them to determine similarity between the actors and directors based on movies they have been involved with. You can regenerate the projection with the below:
CALL gds.graph.project('proj', ['Movie', 'Person'], {
ACTED_IN:{orientation:'UNDIRECTED'},
DIRECTED:{orientation:'UNDIRECTED'}
});
We will then run FastRP like in the last lesson except in mutate mode so the embeddings will be saved in the projection.
CALL gds.fastRP.mutate('proj', {
embeddingDimension:64,
randomSeed:7474,
mutateProperty:'embedding'
})
After that we can run similarity. We will use the default cosine metric. For purposes of demonstration, I will limit topK
to one so we can see the top pairs for each node.
CALL gds.knn.stream('proj', {nodeLabels:['Person'], nodeProperties:['embedding'], topK:1})
YIELD node1, node2, similarity
RETURN gds.util.asNode(node1).name AS actorName1,
gds.util.asNode(node2).name AS actorName2,
similarity
LIMIT 10
Similarity Functions
In addition to the node similarity and KNN algorithms, GDS also provides a set of functions that can be used to calculate similarity between two arrays of numbers using various similarity metrics including jaccard, overlap, pearson, cosine similarity and a few others. The full documentation can be found in the Similarity Functions documentation. These functions are useful when you are interested in measuring similarity between a single select node pair at a time as opposed to calculating similarity over the entire graph.
Check your Understanding
1. Algorithm Purpose
What can similarity algorithms accomplish in GDS? (select all that apply)
-
❏ Finds similarity relationship pairs in the graph
-
✓ Find similarity between pairs of nodes based on the neighboring nodes that they share
-
✓ Calculate the similarity between node pairs based solely on the similarity of node properties
-
✓ Draw relationships to connect similar node pairs together in the graph
Hint
Similarity algorithms determine the similarity between pairs of nodes by examining shared neighboring nodes, evaluating the similarity based on node properties alone, and establishing connections between similar node pairs within the graph.
Solution
The correct answers are:
Find similarity between pairs of nodes based on the neighboring nodes that they share
Calculate the similarity between node pairs based solely on the similarity of node properties
Draw relationships to connect similar node pairs together in the graph
2. Name the Algorithm
Which Similarity algorithm can be used to find node pairs based off of node properties, including properties that were assigned from other graph algorithms like embeddings or community sizes?
-
❏ Cosine
-
❏ Node Similarity
-
✓ K-Nearest Neighbor (KNN)
-
❏ Dijkstra Source-Target
Hint
The algorithm you are looking for uses nodes connected to a source node, in other words its neighbors, to compute simularity.
Solution
The answer is K-Nearest Neighbor (KNN). You can Read more about the kNN algorithm in the GDS documentation.
Summary
In this lesson you learned about similarity algorithms and how to practically apply them in GDS.
In the next lesson, you will learn more about Node Embeddings.