Scaled Properties and FastRP Embeddings

Introduction

Centrality and community detection reveal graph properties. Embeddings encode them into vectors for machine learning.

In this lesson, you’ll scale features, create embeddings with FastRP, and cluster papers to compare with official subject labels.

Graph properties transformed into vector embeddings

What you’ll learn

By the end of this lesson, you’ll be able to:

Scale features for machine learning with scaleProperties
Create node embeddings with FastRP that combine structure, content, and centrality
Cluster papers using K-Means on embeddings
Analyze cluster quality and compare to known labels

The ML Problem

Machine learning algorithms need fixed-size feature vectors:

Neural networks expect consistent input dimensions
sklearn classifiers can’t handle graph structures directly
How do you represent a node’s "position" in a network?

Embeddings solve this by encoding graph structure into dense vectors.

What are Node Embeddings?

Each node gets a fixed-length vector (e.g., 128 dimensions) where:

Similar nodes get similar vectors
Graph structure is encoded
The result is ready for ML algorithms

Nodes in a graph with corresponding embeddings showing similarity and structure encoding.

Why FastRP?

Alternatives:

Node2Vec: Requires random walks, slower
GraphSAGE: Deep learning, needs training data
DeepWalk: Similar to Node2Vec, random walks

FastRP advantages:

Fast: Single-pass, no iteration required
Scalable: Handles millions of nodes
Feature-aware: Can incorporate node properties
No training: Deterministic algorithm

How FastRP Works

FastRP creates embeddings through four steps:

Initialize random vectors for each node
Propagate information along edges over multiple iterations
Combine propagated vectors with configurable weights
Project to final dimension using random projection

Nodes connected to similar neighbors end up with similar embeddings.

The mechanics

If two papers are cited by similar papers, they’re probably about similar topics - even if they don’t cite each other directly.

FastRP encodes this by letting nodes "absorb" information from their neighbors.

After a few rounds, nodes in similar network positions end up with similar vectors.

FastRP encodes node similarity through shared citations without direct links.

FastRP: Initialization

FastRP initializes random vectors for each node.

The vectors don’t need to be meaningful - they just need to be different from each other.

The meaning emerges during propagation.

FastRP: Propagation

FastRP propagates information along edges over multiple iterations.

Each node averages its vector with its neighbors' vectors. After several rounds, nodes in similar network positions end up with similar vectors - even if they’re not directly connected.

The same 5-node graph with arrows showing information flowing along edges. Node A’s vector is being averaged with vectors from its neighbors B and C. Dotted lines show the flow of vector information between connected nodes.

FastRP: Combination

FastRP combines propagated vectors with configurable weights.

The weights control how far the algorithm "looks." A weight of 0.0 on iteration 0 ignores self-information. Higher weights on later iterations emphasize broader neighborhood structure.

Three stacked vectors labeled Iteration 1

FastRP: Projection

FastRP then projects to the final dimension using random projection.

This compression seems like it should lose information, but high-dimensional geometry is counterintuitive - random projections preserve distances between points surprisingly well.

A tall vector with many dimensions on the left labeled 'Combined vector (1435 dims)' being compressed through a funnel-shaped random projection matrix into a shorter vector on the right labeled 'Final embedding (128 dims)'.

FastRP: Results

Nodes connected to similar neighbors end up with similar embeddings.

Feature Engineering First

Before creating embeddings, we need to prepare our features.

Our features have very different ranges:

The features array contains word frequencies between 0 and 1
Betweenness scores range from 0 to thousands
PageRank scores range from 0 to about 20

Without scaling, algorithms would be dominated by the high-magnitude features.

Why We Need to Scale

Machine learning algorithms treat all dimensions equally.

If Betweenness values are 1000x larger than word frequencies, the algorithm will effectively ignore the word frequencies.

Scaling brings all features to comparable ranges so they contribute equally.

Available Scalers in GDS

GDS provides several scaling options:

Scaler	Output range	Best for
MinMax	[0, 1]	Bounded, interpretable values
Max	[-1, 1]	When preserving sign matters
Mean	[-1, 1]	Centering around average
Standard	Unbounded	Statistical normalization (z-scores)
Log	Unbounded	Exponential distributions

Scaler

Output range

Best for

MinMax

[0, 1]

Bounded, interpretable values

Max

[-1, 1]

When preserving sign matters

Mean

[-1, 1]

Centering around average

Standard

Unbounded

Statistical normalization (z-scores)

Log

Unbounded

Exponential distributions

We’ll Use MinMax Scaling

MinMax scaling transforms values to the range [0, 1], where 0 represents the minimum value and 1 represents the maximum.

This works well with embedding algorithms like FastRP, and the results are easy to interpret.

Creating a New Projection for ML

To use scaleProperties, we need a projection that includes all the properties we want to scale.

python

Creating a projection with all properties

G2, result = gds.graph.project(
    'cora-graph-ml',
    {
        'Paper': {
            'properties': {
                'features': {'property': 'features'},
                'betweenness': {'property': 'betweenness', 'defaultValue': 0.0},  # (1)
                'pageRank': {'property': 'pageRank', 'defaultValue': 0.0}
            }
        }
    },
    {
        'CITES': {
            'orientation': 'UNDIRECTED',  # (2)
            'aggregation': 'SINGLE'
        }
    }
)

defaultValue: 0.0 handles nodes missing a centrality score — without this, the projection would fail on nodes without the property
UNDIRECTED orientation lets FastRP propagate information in both directions along citation edges

Scaling Properties with GDS

Now we use gds.scaleProperties.mutate() to scale and combine our features.

python

Scaling and combining features

scaled_result = gds.scaleProperties.mutate(
    G2,
    nodeProperties=['features', 'betweenness', 'pageRank'],  # (1)
    scaler='MinMax',  # (2)
    mutateProperty='scaledFeatures'
)

print(f"Scaled properties for {scaled_result['nodePropertiesWritten']:,} papers")

All three property sources are combined into a single vector — the 1,433-dim word frequencies plus 2 centrality scores
MinMax scales each dimension independently to [0, 1], preventing high-magnitude features from dominating

Understanding the Scaled Output

The scaleProperties procedure combines all input properties into a single feature vector.

Our 1,433-dimensional word frequency vector plus Betweenness plus PageRank becomes a 1,435-dimensional scaled feature vector.

Multiple input properties combined into a single scaled feature vector.

All values are now in the [0, 1] range and ready for FastRP.

Running FastRP

Now we create embeddings using our scaled features.

python

Creating embeddings with FastRP

result = gds.fastRP.mutate(
    G2,
    mutateProperty='embedding',
    embeddingDimension=128,  # (1)
    featureProperties=['scaledFeatures'],  # (2)
    randomSeed=42,
    iterationWeights=[0.0, 1.0, 1.0]  # (3)
)

print(f"Created {result['nodePropertiesWritten']:,} embeddings")

Compresses 1,435 input dimensions down to 128 — a balance between capacity and efficiency
Incorporates our scaled node properties so embeddings reflect content and centrality, not just structure
Ignores self-information (0.0), then weights 1-hop and 2-hop neighbors equally

The featureProperties parameter incorporates our scaled node properties into the embeddings.

Key Parameters

Parameter	Default	Description
embeddingDimension	128	Size of output vectors
featureProperties	(optional)	Node properties to incorporate
iterationWeights	[0.0, 1.0, 1.0]	Weights for each propagation step
randomSeed	(random)	Seed for reproducibility
normalizationStrength	0.0	L2 normalization of embeddings

Parameter

Default

Description

embeddingDimension

128

Size of output vectors

featureProperties

(optional)

Node properties to incorporate

iterationWeights

[0.0, 1.0, 1.0]

Weights for each propagation step

randomSeed

(random)

Seed for reproducibility

normalizationStrength

0.0

L2 normalization of embeddings

Iteration Weights Explained

iterationWeights = [0.0, 1.0, 1.0] means:

Iteration 0 (weight 0.0): Self-information, ignored
Iteration 1 (weight 1.0): 1-hop neighbors, included
Iteration 2 (weight 1.0): 2-hop neighbors, included

More iterations capture longer-range structure.

For broader context, you might use:

python

Capturing up to 4 hops

iterationWeights=[0.0, 0.5, 1.0, 1.0, 0.5]

Embedding Dimension

Common dimensions: 64, 128, 256

Trade-offs:

Higher dimension: More capacity, but slower with overfitting risk
Lower dimension: Faster with less capacity, may lose information

For Cora (2,708 nodes): 128 is reasonable.

For millions of nodes: Consider 64 or 128.

What the Embedding Encodes

Our embedding captures multiple signals:

Content: What the paper is about (word frequencies)
Influence: How important it is (PageRank)
Connectivity: How much it bridges areas (Betweenness)
Structure: Who cites it and who it cites (graph topology)

All compressed from 1,435 dimensions into 128 dimensions.

K-Means Clustering

K-Means clustering is a simple and popular unsupervised learning algorithm that groups similar data points into clusters.

It works by iteratively assigning points to the nearest cluster center and updating the cluster centers until convergence.

Clustering with K-Means

Now we’ll use the embeddings to cluster papers.

python

Clustering papers with K-Means

result = gds.kmeans.write(
    G2,
    nodeProperty='embedding',  # (1)
    k=7,  # (2)
    writeProperty='kmeans7_cluster',
    randomSeed=42
)

print(f"Created {result['communityCount']} clusters")

Clusters based on the FastRP embeddings, which encode structure + content + centrality
k=7 matches the number of official subject categories — lets us directly compare clusters to known labels

We use k=7 to match the number of official subject categories.

K-Means vs Louvain

Compare with Louvain from Lesson 8:

Aspect	Louvain	K-Means on Embeddings
Input	Graph structure only	Structure + content + centrality
Method	Modularity optimization	Distance in embedding space
Clusters	Found automatically	You specify k
Best for	Finding natural communities	Comparing to known categories

Aspect

Louvain

K-Means on Embeddings

Input

Graph structure only

Structure + content + centrality

Method

Modularity optimization

Distance in embedding space

Clusters

Found automatically

You specify k

Best for

Finding natural communities

Comparing to known categories

Analyzing Cluster Quality

Compare K-Means clusters to official subject labels:

python

Cluster composition analysis

q = """
MATCH (p:Paper)
WHERE p.kmeans7_cluster IS NOT NULL
RETURN p.kmeans7_cluster AS cluster,  // (1)
       p.subject AS subject,
       count(*) AS count
ORDER BY cluster, count DESC  // (2)
"""
df = gds.run_cypher(q)

Each paper was assigned a cluster (0-6) by K-Means based on embedding similarity
Ordering by count DESC within each cluster reveals the dominant subject — high purity means the embedding captured meaningful structure

Cluster purity measures how well clusters align with known labels.

High purity suggests the embedding captured meaningful structure.

Interpreting Cluster Results

When analyzing clusters:

Pure clusters (>70%): Tightly focused areas that align with official subjects
Mixed clusters (<50%): Interdisciplinary areas spanning multiple subjects
Mismatches: May reveal research groupings not captured by formal labels

This is powerful because clusters emerge from citation patterns AND content AND centrality.

Other Uses for Embeddings

Embeddings enable many downstream tasks beyond clustering:

Similarity search: Find papers most similar to a given paper using KNN
Classification: Train ML models to predict paper subjects
Link prediction: Predict future citations

In this lesson, we focused on clustering to compare with known labels.

Reproducibility

Use randomSeed for consistent results:

python

Setting a random seed

gds.fastRP.mutate(G2, ..., randomSeed=42)

Important for debugging, reproducible experiments, and production consistency.

Without a seed, results will vary slightly between runs.

Common Pitfalls

Pitfall 1: Missing features

If featureProperties references non-existent properties, FastRP fails. Check with G2.node_properties().

Pitfall 2: Unscaled features

Mix of large and small values? FastRP may be dominated by large values. Always scale first.

Pitfall 3: Wrong dimension

Too small loses information. Too large risks overfitting and slows computation. Start with 128.

Pitfall 4: Wrong interpretation

FastRP encodes the structure of the graph — but it does not encode localized behaviour by default.

Sometimes FastRP users confuse proximal structure (FastRP) with structural roles.

For example, if two nodes have precisely the same local structures but exist in different parts of the graph, FastRP will not find them.

If, however, we add featureProperties to the graph, we can then encode features.

Summary

In this lesson, you:

Scaled features with scaleProperties to prepare for machine learning
Created 128-dimensional embeddings with FastRP combining structure, content, and centrality
Clustered papers with K-Means and compared to official subjects
Learned that embeddings enable many downstream ML tasks

This completes Module 5. You’ve built a full GDS workflow from data loading through embeddings and clustering.

Graph Data Science in Practice

GDS Foundations

Community Detection for Fraud

GDS Python Client

Aura Graph Analytics

Scaled Properties and FastRP Embeddings

Introduction

What you’ll learn

The ML Problem

What are Node Embeddings?

Why FastRP?

How FastRP Works

The mechanics

FastRP: Initialization

FastRP: Propagation

FastRP: Combination

FastRP: Projection

FastRP: Results

Feature Engineering First

Why We Need to Scale

Available Scalers in GDS

We’ll Use MinMax Scaling

Creating a New Projection for ML

Scaling Properties with GDS

Understanding the Scaled Output

Running FastRP

Key Parameters

Iteration Weights Explained

Embedding Dimension

What the Embedding Encodes

K-Means Clustering

Clustering with K-Means

K-Means vs Louvain

Analyzing Cluster Quality

Interpreting Cluster Results

Other Uses for Embeddings

Reproducibility

Common Pitfalls

Summary

Chatbot