Build the Themes Tool

Challenge

The themes shape surfaces what nobody named. Read its spec first - docs/theme-format.md - then build skill/scripts/themes.py: the pipeline and renderer are given; the projection query is marked BUILD FROM SPEC.

This completes Building Block 2: "Themes nobody named exist as data" ✓

The reasoning before the query

Documents rarely link each other directly. But the Falcon manual and bulletin TSB-21-114 both reference coil IC-2042-A - they are about the same thing. The spec’s key move: parts and codes are glue nodes. Project a document-level graph of documents plus glue, and two documents that co-cite the same part sit two hops apart - close enough to cluster, even with no link between them.

Collapsing section-level edges to their owning documents needs no tree walk - the URIs already encode ownership:

cypher
Ownership is a string operation on hierarchical URIs
MATCH (d:Document {uri: split(s.uri, '#')[0]})

Hand the projection spec to your agent and review the Cypher: sections' REFERENCES_PART, REFERENCES_CODE, and LINKS_TO edges, collapsed to document pairs and document-glue pairs, undirected, weighted by mention count.

Run the pipeline

The rest of the script is given and runs the spec’s pipeline: Leiden mutate → per-community conductance (the cohesion words) → write themeId back → renderer queries → drop the projection. Mutate-before-write matters: the conductance metric reads themeId from the in-memory projection, which never sees database writes.

shell
python skill/scripts/themes.py

Two broad themes appear - electrical-and-brakes and ignition-and-catalyst. Structurally defensible, but coarser than how technicians think.

Turn the dial

Leiden’s gamma is the granularity dial - higher favors more, finer themes:

shell
python skill/scripts/themes.py --gamma 2.0

Four themes now - sensors and hoses, ignition, brakes, charging - and the header still reconciles: every document is grouped or honestly ungrouped. There is no single correct resolution; it depends on the questions you will ask. Note the spec’s stability contract: theme numbers are assigned per run - store URIs, never T<id>.

Stuck or out of sync?

The complete script is in solutions/scripts/themes.py.

Validate the Themes

Once themes.py has written themeIds back, click the Check Database button to verify.

Hint

Only the write-back step changes the database - the projection and Leiden run in memory.

Check:

  • The projection query you built returns rows (test it in the sandbox first)

  • python skill/scripts/themes.py printed the THEMES header with grouped documents

Solution

Run the complete tool:

shell
python solutions/scripts/themes.py

Then check the assignments:

cypher
MATCH (d:Document) WHERE d.themeId IS NOT NULL
RETURN d.themeId AS theme, collect(d.id) AS documents

Most documents should carry a theme, in at least two groups.

If verification fails:

  • If the projection already exists from a crashed run, the script drops it automatically - re-run it

  • Confirm Module 2’s verification passes first (the references and links must exist)

Summary

You built the themes shape:

  • Glue-node projection - documents + parts/codes, co-citation as clustering signal, ownership by URI prefix

  • Leiden mutate → conductance → write - granularity on the gamma dial, cohesion as words not scores

  • Building Block 2: "Themes nobody named exist as data" ✓

In the next lesson, you will read the theme blocks the way an agent does - and name the themes yourself.

Chatbot

How can I help you today?

Data Model

Your data model will appear here.