Building Fraud Communities

Introduction

You’ve learned the algorithms. Now let’s put them together.

In this lesson, you’ll create relationships that encode fraud hypotheses, use Degree Centrality to filter noise, and run WCC to identify communities of connected suspects.

What You’ll Learn

By the end of this lesson, you’ll be able to:

  • Encode domain hypotheses as graph relationships

  • Combine Degree Centrality filtering with WCC community detection

  • Identify fraud risk users through guilt-by-association

  • Validate results against transaction volume

Why Not Just Use Louvain?

Louvain was valuable for exploration, but it has limitations for formal fraud assignment:

  • Non-deterministic: Results vary between runs

  • Density-based: Groups by connection density, not by fraud patterns

  • No semantic awareness: A fraudster and their victim might land in the same community

We need to encode our hypotheses about what fraud looks like—then use deterministic algorithms to find connected groups.

The Solution: Entity Resolution

Entity Resolution (ER) is the process of determining when multiple records represent the same real-world entity.

In fraud detection, we use ER to identify when multiple user accounts likely belong to the same person or group.

We encode this as new relationships in the graph.

From Exploration to Hypothesis

Louvain revealed a pattern:

Flagged users sending money to unflagged users with whom they share credit cards.

This is suspicious because it suggests:

  • The same person controls both accounts

  • Money is being moved before a chargeback hits

Let’s encode this pattern as a relationship.

The Methodology

  1. Explore: Louvain revealed suspicious patterns

  2. Hypothesize: "Shared cards + transactions = likely same actor"

  3. Encode: Create relationships representing the hypothesis

  4. Resolve: WCC finds connected components

  5. Label: Mark users in fraud-connected communities

Part 1: Encoding Fraud Hypotheses

ER Rule 1: P2P with Shared Card

If User A sent money to User B, and they share a credit card, that’s suspicious.

Two users both share a card

Step 1: Create P2P_SHARED_CARD Relationships

Encode the "sent money to someone sharing my card" pattern:

cypher
Create shared card relationship
MATCH (u1:UserP2P)-[r:P2P]->(u2)
WITH u1, u2, count(r) AS transaction
MATCH (u1)-[:HAS_CC]->(n)<-[:HAS_CC]-(u2) // (1)
WITH u1, u2, count(DISTINCT n) AS sharedCard
MERGE(u1)-[s:P2P_SHARED_CARD]->(u2) // (2)
RETURN count(DISTINCT s) AS P2PWithSharedCard
  1. Finds credit cards shared between the sender and receiver — this two-hop pattern through a Card node is the core fraud signal

  2. Creates a new relationship type encoding our hypothesis — MERGE ensures we don’t create duplicates

This finds user pairs where User 1 sent money to User 2, and they share at least one credit card.

You should see approximately 6,000+ relationships created. Each represents a potentially fraudulent money transfer.

ER Rule 2: Shared Identifiers

Another pattern: users sharing multiple identifiers (cards, devices, IPs).

Sharing one thing might be coincidence. Sharing several suggests coordination.

But there’s a problem: some identifiers connect to hundreds of users, like coffee shop IP addresses, public library devices.

The High-Degree Problem

Before encoding this pattern, we need to handle noise.

Some identifiers connect to hundreds of users:

  • Coffee shop IP addresses

  • Public library devices

  • Compromised cards sold on the dark web

Illustration showing high-degree identifiers with many user connections

Sharing these isn’t suspicious—it’s meaningless. We need to filter them out.

Step 2: Calculate Identifier Degrees

First, create a projection of users and their shared identifiers:

cypher
Project a graph of User and shared identifiers
MATCH (source:UserP2P)-[r:HAS_CC|USED|HAS_IP]->(target:Card|Device|IP) // (1)
WITH gds.graph.project(
  'identifier-graph',
  source,
  target
) AS g
RETURN g.graphName, g.nodeCount, g.relationshipCount // (2)
  1. Projects all user-to-identifier relationships — this heterogeneous projection includes Cards, Devices, and IPs as targets

  2. Returns projection metadata to verify the expected number of nodes and relationships were captured

Step 3: Run Degree Centrality

Compute degree for all identifier nodes:

cypher
Run Degree Centrality
CALL gds.degree.write('identifier-graph', {
  writeProperty: 'degree',
  orientation: 'REVERSE' // (1)
})
YIELD nodePropertiesWritten
RETURN nodePropertiesWritten
  1. REVERSE counts incoming relationships — for identifier nodes (Cards, Devices, IPs), this measures how many users connect to each one, revealing high-degree noise nodes

Each Card, Device, and IP node now has a degree property indicating how many users connect to it.

Step 4: Clean Up the Projection

cypher
Drop graph
CALL gds.graph.drop('identifier-graph')

Step 5: Create SHARED_IDS Relationships

Now encode the "shares multiple identifiers" pattern, filtering out high-degree noise:

cypher
Create relationships based on shared IDs
MATCH (u1:UserP2P)-[:HAS_CC|USED]->(n:Card|Device)<-[:HAS_CC|USED]-(u2:UserP2P)
WHERE n.degree <= 10 // (1)
  AND u1 < u2
WITH u1, u2, count(DISTINCT n) AS sharedIdentifiers
MATCH (u1)-[:HAS_CC|USED|HAS_IP]->(m)<-[:HAS_CC|USED|HAS_IP]-(u2)
WITH u1, u2, count(DISTINCT m) as sharedIdentifiers
WHERE sharedIdentifiers > 2 // (2)
MERGE (u1)-[s:SHARED_IDS {count: sharedIdentifiers}]->(u2) // (3)
RETURN count(s) AS relationshipsCreated
  1. Filters out high-degree identifiers (public devices, shared proxies) using the degree property we computed in Step 3

  2. Requires more than 2 shared identifiers — sharing one thing is coincidence, sharing several suggests coordination

  3. Stores the count of shared identifiers as a relationship property for potential use as a weight in downstream algorithms

You should see approximately 5,000+ relationships created. Combined with the P2P_SHARED_CARD relationships, these form the basis of our fraud communities.

Understanding the Filters

u1 < u2 — Prevents duplicate relationships (A→B and B→A)

n.degree ⇐ 10 — Ignores high-degree nodes (shared proxies, public devices)

sharedIdentifiers > 2 — Requires multiple shared identifiers, not just one

These filters reduce false positives from coincidental sharing.

Business Rules

The threshold is a domain judgment, not a magic number.

  • Too low (2-3): Miss legitimate fraud patterns

  • Too high (100+): Include meaningless shared proxies

  • 10: Reasonable starting point for investigation

In production, you’d tune this based on your data and false positive rates. You would likely also include many more rules than these.

Why Encode Patterns as Relationships?

Once our hypotheses are relationships, we can:

  • Visualize them directly

  • Query them efficiently

  • Run algorithms on them specifically

The graph becomes a model of our fraud theory, not just raw data.

Step 6: Visualize the ER Relationships

Let’s take a closer look at what we’ve created:

cypher
View the ER communities
MATCH path = (u1:UserP2P)-[r:P2P_SHARED_CARD|SHARED_IDS*10]-(u2:UserP2P) // (1)
RETURN path
LIMIT 200
  1. Traverses up to 10 hops along only our fraud-hypothesis relationship types — this reveals the emerging community structure before running WCC

These relationships connect users we believe may be the same actors or members of coordinated groups. Notice how they form clusters—these will become our WCC communities.

Part 2: Finding Connected Communities

Step 7: Project the ER Graph

Next, we’ll create a projection using only our ER relationships:

cypher
Create an ER-only projection
MATCH (source:UserP2P)
OPTIONAL MATCH (source)-[r:P2P_SHARED_CARD|SHARED_IDS]->(target:UserP2P) // (1)
RETURN gds.graph.project(
  'er-graph',
  source,
  target,
  {relationshipType: type(r)},
  {undirectedRelationshipTypes: ['*']} // (2)
)
  1. OPTIONAL MATCH ensures all UserP2P nodes are included in the projection, even those with no ER relationships — they’ll become singleton components in WCC

  2. Makes all relationships undirected so WCC treats fraud connections as bidirectional

We project only ER relationships so WCC finds components connected by suspicious patterns—not general network proximity.

Why This Projection Matters

We’re projecting only the fraud-hypothesis relationships—not the original P2P, HAS_CC, or USED relationships.

This means WCC will find components based on our definitions of suspicious behavior, not on general network connectivity.

Users end up in the same component only if they’re connected by patterns we’ve explicitly encoded.

Step 8: Run WCC

Finally we run WCC to find connected components:

cypher
Write WCC IDs
CALL gds.wcc.write('er-graph', {
  writeProperty: 'wccId' // (1)
})
YIELD componentCount, componentDistribution // (2)
RETURN componentCount, componentDistribution
  1. Writes a wccId property to each node — users in the same component share the same ID, forming our fraud communities

  2. Returns the total number of components and their size distribution to assess the community structure

Interpreting WCC Results

You should see approximately 28,000 components.

Most contain one or two users. The interesting ones are multi-user components—these could be our fraud communities.

Check the componentDistribution:

  • min: 1 (isolated users)

  • max: ~175 (largest community)

  • mean: ~2.4 (most components are small)

Step 9: Examine Component Sizes

See the distribution of community sizes:

cypher
Get community distribution
MATCH (u:UserP2P)
WHERE u.wccId IS NOT NULL
WITH u.wccId AS community, count(*) AS size // (1)
RETURN size, count(*) AS communitiesOfThisSize // (2)
ORDER BY size
  1. Groups users by their WCC component to calculate the size of each community

  2. Aggregates again to show how many communities exist at each size — this reveals the power-law distribution typical of fraud networks

Most communities are size 1—users not connected by suspicious patterns. Communities of size 2+ are our investigation targets.

Inspect distribution

Your previous query should have returned something like this:

size communitiesOfThisSize

1

24283

2

3165

3

481

4

151

5

61

…​

…​

175

1

The Guilt-by-Association Principle

Our hypothesis is "if a user is connected (via fraud patterns) to a known fraudster, they warrant investigation."

This isn’t proof of guilt—it’s prioritization. We’re identifying users who deserve closer scrutiny.

Step 10: Find Communities with Known Fraud

Which communities contain flagged users?

cypher
Get communities with flagged users
MATCH (u:UserP2P)
WHERE u.wccId IS NOT NULL
WITH u.wccId AS community,
  count(*) AS totalUsers,
  sum(u.fraudMoneyTransfer) AS flaggedUsers // (1)
WHERE flaggedUsers > 0
RETURN community, totalUsers, flaggedUsers,
  totalUsers - flaggedUsers AS unflaggedUsers // (2)
ORDER BY unflaggedUsers DESC
LIMIT 10
  1. Sums the fraud flag across each community — communities with flaggedUsers > 0 contain known fraud

  2. Calculates unflagged users per community — these are our investigation targets: users connected to fraudsters by suspicious patterns but not yet identified

Inspecting results

The previous query should return a table that looks like this:

╒[cols="1,1,1,1"]

community totalUsers flaggedUsers unflaggedUsers

619

37

3

34

2798

11

1

10

3895

11

2

9

712

10

1

9

…​

…​

…​

…​

Communities with flaggedUsers > 0 contain known fraud. Those with unflaggedUsers > 0 contain unlabeled users connected by suspicious patterns—our new fraud risks.

Step 11: Label Fraud Risk Users

Now, we can mark all users in fraud-connected communities as fraud risks:

cypher
Mark users in fraud communities
MATCH (flagged:UserP2P)
WHERE flagged.fraudMoneyTransfer = 1
  AND flagged.wccId IS NOT NULL
WITH collect(DISTINCT flagged.wccId) AS flaggedCommunities // (1)
MATCH (u:UserP2P)
WHERE u.wccId IN flaggedCommunities
  AND u.fraudMoneyTransfer = 0 // (2)
SET u:FraudRisk, u.fraudRisk = 1 // (3)
RETURN count(u) AS newFraudRiskUsers
  1. Collects all WCC community IDs that contain at least one known fraudster

  2. Finds unflagged users in those communities — these are connected to fraudsters through our encoded patterns but haven’t been identified yet

  3. Adds both a label (:FraudRisk) and a property (fraudRisk = 1) for flexible downstream querying

Results

You should identify 211 new fraud risk users.

These are users who:

  • Were not flagged by the original chargeback logic

  • Are connected to flagged users via suspicious patterns (shared cards + transactions, or multiple shared identifiers)

Part 4: Validation

Why Validate?

We’ve identified 211 users. But are they significant, or just noise?

One way to check: measure their transaction volume. If these users handle substantial money flow, our identification is meaningful.

Step 12: Validate the Labeling

How much money flows through these newly identified accounts?

cypher
Find money flowing through accounts
MATCH (risk:FraudRisk)-[p:P2P]-() // (1)
WITH sum(p.totalAmount) AS riskP2P
MATCH ()-[p:P2P]->()
WITH riskP2P, sum(p.totalAmount) AS totalP2P
RETURN round(100.0 * riskP2P / totalP2P, 1) AS percentOfAllP2P // (2)
  1. Matches all P2P transactions involving FraudRisk users — both sent and received — to capture their full economic footprint

  2. Calculates the percentage of total P2P volume flowing through these 211 users — a high percentage relative to their population confirms they are economically significant actors

You should find that these 211 users (less than 1% of total users) account for approximately 12-13% of all P2P transaction volume.

Interpreting the Validation

These 211 users represent less than 0.1% of all users—yet they handle approximately 13% of all P2P transaction volume.

If we’d selected 211 users at random, we’d expect them to handle roughly 0.1% of volume. We’re seeing 130 times that amount.

This suggests our methodology has identified economically significant actors, not random noise.

Step 13: Visualize a Fraud Community

Pick a community and examine it:

cypher
Examine just one community
MATCH (u:UserP2P)
WHERE u.wccId IS NOT NULL AND u.fraudMoneyTransfer = 1
WITH u.wccId AS community, count(*) AS fraudCount
ORDER BY fraudCount DESC LIMIT 1 // (1)
WITH community
MATCH path = (u1:UserP2P)-[:P2P_SHARED_CARD|SHARED_IDS|HAS_CC|USED]-(n)--(u2:UserP2P)
WHERE u1.wccId = community AND u2.wccId = community // (2)
RETURN path
LIMIT 200
  1. Selects the community with the most known fraudsters for the most informative visualization

  2. Filters both endpoints to the same community and traverses through shared infrastructure (cards, devices) to reveal the fraud ring structure

In the visualization:

  • Flagged users (fraudMoneyTransfer = 1) are known fraudsters

  • FraudRisk users are newly identified through ER + WCC

  • Notice how they’re connected by shared cards and devices

Cleanup

Drop the projection:

cypher
Drop the ER projection
CALL gds.graph.drop('er-graph')

What We Built

Step What We Did Result

Encode hypotheses

Created P2P_SHARED_CARD and SHARED_IDS relationships

~11,000 fraud-pattern edges

Filter noise

Used Degree Centrality to exclude high-degree identifiers

Removed false connections

Find communities

Ran WCC on fraud-hypothesis relationships

~28,000 components

Label risk

Marked users in fraud-connected components

211 new suspects

Validate

Checked transaction volume

13% of P2P volume

Algorithms Are Tools

The algorithms we used—Degree Centrality and WCC—are simple by design.

The power came from:

  • Encoding domain knowledge as relationships

  • Filtering noise before community detection

  • Projecting selectively to control what "connected" means

Algorithms find structure. You define what structure matters through relationship design and projection choices.

Extending This Approach

In production, you’d likely:

  • Add more ER rules — Same device + same IP, velocity patterns, behavioral similarity

  • Weight relationships — More shared identifiers = stronger connection

  • Use thresholds in WCC — Only consider strong connections

  • Iterate — Validate, tune thresholds, add rules, repeat

Summary

You’ve applied the full fraud detection methodology:

  • Created P2P_SHARED_CARD relationships (6,000+)

  • Created SHARED_IDS relationships (5,000+)

  • Ran WCC to find connected components (~28,000)

  • Identified 211 new fraud risk users

  • Validated they represent ~13% of transaction volume

Remember, algorithms find structure, but you define what structure matters through relationship design and projection modeling.

The power isn’t in the algorithms themselves—it’s in encoding your fraud hypotheses as graph relationships, then letting simple algorithms like WCC find the connected groups.

Chatbot

How can I help you today?