Introduction
Centrality and community detection reveal graph properties. Embeddings encode them into vectors for machine learning.
In this lesson, you’ll scale features, create embeddings with FastRP, and cluster papers to compare with official subject labels.
What you’ll learn
By the end of this lesson, you’ll be able to:
-
Scale features for machine learning with
scaleProperties -
Create node embeddings with FastRP that combine structure, content, and centrality
-
Cluster papers using K-Means on embeddings
-
Analyze cluster quality and compare to known labels
The ML Problem
Machine learning algorithms need fixed-size feature vectors:
-
Neural networks expect consistent input dimensions
-
sklearn classifiers can’t handle graph structures directly
-
How do you represent a node’s "position" in a network?
Embeddings solve this by encoding graph structure into dense vectors.
What are Node Embeddings?
Each node gets a fixed-length vector (e.g., 128 dimensions) where:
-
Similar nodes get similar vectors
-
Graph structure is encoded
-
The result is ready for ML algorithms
Why FastRP?
Alternatives:
-
Node2Vec: Requires random walks, slower
-
GraphSAGE: Deep learning, needs training data
-
DeepWalk: Similar to Node2Vec, random walks
FastRP advantages:
-
Fast: Single-pass, no iteration required
-
Scalable: Handles millions of nodes
-
Feature-aware: Can incorporate node properties
-
No training: Deterministic algorithm
How FastRP Works
FastRP creates embeddings through four steps:
-
Initialize random vectors for each node
-
Propagate information along edges over multiple iterations
-
Combine propagated vectors with configurable weights
-
Project to final dimension using random projection
Nodes connected to similar neighbors end up with similar embeddings.
The mechanics
If two papers are cited by similar papers, they’re probably about similar topics - even if they don’t cite each other directly.
FastRP encodes this by letting nodes "absorb" information from their neighbors.
After a few rounds, nodes in similar network positions end up with similar vectors.
FastRP: Initialization
FastRP initializes random vectors for each node.
The vectors don’t need to be meaningful - they just need to be different from each other.
The meaning emerges during propagation.
FastRP: Propagation
FastRP propagates information along edges over multiple iterations.
Each node averages its vector with its neighbors' vectors. After several rounds, nodes in similar network positions end up with similar vectors - even if they’re not directly connected.
FastRP: Combination
FastRP combines propagated vectors with configurable weights.
The weights control how far the algorithm "looks." A weight of 0.0 on iteration 0 ignores self-information. Higher weights on later iterations emphasize broader neighborhood structure.
FastRP: Projection
FastRP then projects to the final dimension using random projection.
This compression seems like it should lose information, but high-dimensional geometry is counterintuitive - random projections preserve distances between points surprisingly well.
FastRP: Results
Nodes connected to similar neighbors end up with similar embeddings.
Feature Engineering First
Before creating embeddings, we need to prepare our features.
Our features have very different ranges:
-
The
featuresarray contains word frequencies between 0 and 1 -
Betweenness scores range from 0 to thousands
-
PageRank scores range from 0 to about 20
Without scaling, algorithms would be dominated by the high-magnitude features.
Why We Need to Scale
Machine learning algorithms treat all dimensions equally.
If Betweenness values are 1000x larger than word frequencies, the algorithm will effectively ignore the word frequencies.
Scaling brings all features to comparable ranges so they contribute equally.
Available Scalers in GDS
GDS provides several scaling options:
| Scaler | Output range | Best for |
|---|---|---|
MinMax |
[0, 1] |
Bounded, interpretable values |
Max |
[-1, 1] |
When preserving sign matters |
Mean |
[-1, 1] |
Centering around average |
Standard |
Unbounded |
Statistical normalization (z-scores) |
Log |
Unbounded |
Exponential distributions |
We’ll Use MinMax Scaling
MinMax scaling transforms values to the range [0, 1], where 0 represents the minimum value and 1 represents the maximum.
This works well with embedding algorithms like FastRP, and the results are easy to interpret.
Creating a New Projection for ML
To use scaleProperties, we need a projection that includes all the properties we want to scale.
G2, result = gds.graph.project(
'cora-graph-ml',
{
'Paper': {
'properties': {
'features': {'property': 'features'},
'betweenness': {'property': 'betweenness', 'defaultValue': 0.0}, # (1)
'pageRank': {'property': 'pageRank', 'defaultValue': 0.0}
}
}
},
{
'CITES': {
'orientation': 'UNDIRECTED', # (2)
'aggregation': 'SINGLE'
}
}
)-
defaultValue: 0.0handles nodes missing a centrality score — without this, the projection would fail on nodes without the property -
UNDIRECTEDorientation lets FastRP propagate information in both directions along citation edges
Scaling Properties with GDS
Now we use gds.scaleProperties.mutate() to scale and combine our features.
scaled_result = gds.scaleProperties.mutate(
G2,
nodeProperties=['features', 'betweenness', 'pageRank'], # (1)
scaler='MinMax', # (2)
mutateProperty='scaledFeatures'
)
print(f"Scaled properties for {scaled_result['nodePropertiesWritten']:,} papers")-
All three property sources are combined into a single vector — the 1,433-dim word frequencies plus 2 centrality scores
-
MinMax scales each dimension independently to [0, 1], preventing high-magnitude features from dominating
Understanding the Scaled Output
The scaleProperties procedure combines all input properties into a single feature vector.
Our 1,433-dimensional word frequency vector plus Betweenness plus PageRank becomes a 1,435-dimensional scaled feature vector.
All values are now in the [0, 1] range and ready for FastRP.
Running FastRP
Now we create embeddings using our scaled features.
result = gds.fastRP.mutate(
G2,
mutateProperty='embedding',
embeddingDimension=128, # (1)
featureProperties=['scaledFeatures'], # (2)
randomSeed=42,
iterationWeights=[0.0, 1.0, 1.0] # (3)
)
print(f"Created {result['nodePropertiesWritten']:,} embeddings")-
Compresses 1,435 input dimensions down to 128 — a balance between capacity and efficiency
-
Incorporates our scaled node properties so embeddings reflect content and centrality, not just structure
-
Ignores self-information (0.0), then weights 1-hop and 2-hop neighbors equally
The featureProperties parameter incorporates our scaled node properties into the embeddings.
Key Parameters
| Parameter | Default | Description |
|---|---|---|
embeddingDimension |
128 |
Size of output vectors |
featureProperties |
(optional) |
Node properties to incorporate |
iterationWeights |
[0.0, 1.0, 1.0] |
Weights for each propagation step |
randomSeed |
(random) |
Seed for reproducibility |
normalizationStrength |
0.0 |
L2 normalization of embeddings |
Iteration Weights Explained
iterationWeights = [0.0, 1.0, 1.0] means:
-
Iteration 0 (weight 0.0): Self-information, ignored
-
Iteration 1 (weight 1.0): 1-hop neighbors, included
-
Iteration 2 (weight 1.0): 2-hop neighbors, included
More iterations capture longer-range structure.
For broader context, you might use:
iterationWeights=[0.0, 0.5, 1.0, 1.0, 0.5]Embedding Dimension
Common dimensions: 64, 128, 256
Trade-offs:
-
Higher dimension: More capacity, but slower with overfitting risk
-
Lower dimension: Faster with less capacity, may lose information
For Cora (2,708 nodes): 128 is reasonable.
For millions of nodes: Consider 64 or 128.
What the Embedding Encodes
Our embedding captures multiple signals:
-
Content: What the paper is about (word frequencies)
-
Influence: How important it is (PageRank)
-
Connectivity: How much it bridges areas (Betweenness)
-
Structure: Who cites it and who it cites (graph topology)
All compressed from 1,435 dimensions into 128 dimensions.
K-Means Clustering
K-Means clustering is a simple and popular unsupervised learning algorithm that groups similar data points into clusters.
It works by iteratively assigning points to the nearest cluster center and updating the cluster centers until convergence.
Clustering with K-Means
Now we’ll use the embeddings to cluster papers.
result = gds.kmeans.write(
G2,
nodeProperty='embedding', # (1)
k=7, # (2)
writeProperty='kmeans7_cluster',
randomSeed=42
)
print(f"Created {result['communityCount']} clusters")-
Clusters based on the FastRP embeddings, which encode structure + content + centrality
-
k=7 matches the number of official subject categories — lets us directly compare clusters to known labels
We use k=7 to match the number of official subject categories.
K-Means vs Louvain
Compare with Louvain from Lesson 8:
| Aspect | Louvain | K-Means on Embeddings |
|---|---|---|
Input |
Graph structure only |
Structure + content + centrality |
Method |
Modularity optimization |
Distance in embedding space |
Clusters |
Found automatically |
You specify k |
Best for |
Finding natural communities |
Comparing to known categories |
Analyzing Cluster Quality
Compare K-Means clusters to official subject labels:
q = """
MATCH (p:Paper)
WHERE p.kmeans7_cluster IS NOT NULL
RETURN p.kmeans7_cluster AS cluster, // (1)
p.subject AS subject,
count(*) AS count
ORDER BY cluster, count DESC // (2)
"""
df = gds.run_cypher(q)-
Each paper was assigned a cluster (0-6) by K-Means based on embedding similarity
-
Ordering by count DESC within each cluster reveals the dominant subject — high purity means the embedding captured meaningful structure
Cluster purity measures how well clusters align with known labels.
High purity suggests the embedding captured meaningful structure.
Interpreting Cluster Results
When analyzing clusters:
-
Pure clusters (>70%): Tightly focused areas that align with official subjects
-
Mixed clusters (<50%): Interdisciplinary areas spanning multiple subjects
-
Mismatches: May reveal research groupings not captured by formal labels
This is powerful because clusters emerge from citation patterns AND content AND centrality.
Other Uses for Embeddings
Embeddings enable many downstream tasks beyond clustering:
-
Similarity search: Find papers most similar to a given paper using KNN
-
Classification: Train ML models to predict paper subjects
-
Link prediction: Predict future citations
In this lesson, we focused on clustering to compare with known labels.
Reproducibility
Use randomSeed for consistent results:
gds.fastRP.mutate(G2, ..., randomSeed=42)Important for debugging, reproducible experiments, and production consistency.
Without a seed, results will vary slightly between runs.
Common Pitfalls
Pitfall 1: Missing features
If featureProperties references non-existent properties, FastRP fails. Check with G2.node_properties().
Pitfall 2: Unscaled features
Mix of large and small values? FastRP may be dominated by large values. Always scale first.
Pitfall 3: Wrong dimension
Too small loses information. Too large risks overfitting and slows computation. Start with 128.
Pitfall 4: Wrong interpretation
FastRP encodes the structure of the graph — but it does not encode localized behaviour by default.
Sometimes FastRP users confuse proximal structure (FastRP) with structural roles.
For example, if two nodes have precisely the same local structures but exist in different parts of the graph, FastRP will not find them.
If, however, we add featureProperties to the graph, we can then encode features.
Summary
In this lesson, you:
-
Scaled features with
scalePropertiesto prepare for machine learning -
Created 128-dimensional embeddings with FastRP combining structure, content, and centrality
-
Clustered papers with K-Means and compared to official subjects
-
Learned that embeddings enable many downstream ML tasks
This completes Module 5. You’ve built a full GDS workflow from data loading through embeddings and clustering.