Test your understanding of the three shapes and the boundary crossing.
The Context Problem
Sam’s vector search returned passages that were semantically similar to the technician’s question, but the copilot still gave unreliable answers. What is the core reason?
-
❏ A. The embedding model was too small
-
✓ B. The answer needs a connected set of documents and records, not a pile of similar paragraphs
-
❏ C. Vector search cannot index PDF files
-
❏ D. The PDFs were not chunked at the right size
Hint
The workshop calls this "right meaning, wrong shape" - what does the answer to Dani’s question actually look like?
Solution
B is correct: Dani’s question needs a connected set - the bulletin, the part it names, the work orders that used that part, and the vehicles they were on. Vector search returns disconnected passages ranked by similarity; no amount of embedding quality changes the shape of what it returns.
Why others are wrong:
-
A: A better model returns better-ranked paragraphs - still paragraphs
-
C: Vector search indexes parsed PDF text without difficulty
-
D: Chunking affects retrieval quality, not the fundamental shape mismatch
Recall Module 1: The three-layer wall - vector search is layer one.
The Tree Shape
What does the (Library)-[:HAS*]→(Section) containment tree give an agent that keyword or vector search cannot?
-
❏ A. Faster text matching
-
❏ B. Automatic summarization of each document
-
✓ C. The ability to navigate document structure - tables of contents, chapters, and what a document covers
-
❏ D. Smaller storage requirements
Hint
Think about the first query you ran against the Falcon manual in Module 2.
Solution
C is correct: The variable-length pattern [:HAS*] walks the library’s structure, producing views like a table of contents or "every section under the Engine chapter".
Search tools have no concept of "contains" - they can only rank fragments.
Why others are wrong:
-
A: The tree is about structure, not matching speed
-
B: Summarization is an LLM task; the tree tells the LLM what to read
-
D: Storing structure adds nodes and relationships - it does not reduce storage
Recall Module 2: The table-of-contents query over the Falcon manual.
Meaningful Themes
Why did Leiden’s document communities correspond to real repair themes instead of arbitrary clusters?
-
❏ A. Leiden is guaranteed to find meaningful clusters in any graph
-
❏ B. The gamma parameter forces communities to match business topics
-
✓ C. Every edge in the projection exists because two documents touch the same part or fault code (or cite each other), so dense clusters are documents about the same repair topic
-
❏ D. The sections were manually tagged with topics before the workshop
Hint
The algorithm only sees nodes and links. What made each link exist in the first place?
Solution
C is correct: Community detection finds dense clusters - the meaning comes from the edges. Because the projection’s edges are shared parts, shared codes (glue nodes), and citations, a dense cluster can only mean "documents that keep talking about the same parts and faults" - a real repair theme.
Why others are wrong:
-
A: On arbitrary links (or random similarity), the clusters would be arbitrary too
-
B: Gamma controls granularity (how many themes), not meaning
-
D: Nothing was tagged - surfacing unnamed patterns was the point
Recall Module 3: "Why these clusters mean something."
The Boundary Crossing
The finale answers a question that spans the documents and the warehouse. How does the agent cross that boundary?
-
❏ A. It merges the warehouse rows into Neo4j and traverses one graph
-
✓ B. It grounds candidates in the Neo4j document graph, then reads the live warehouse rows from BigQuery on the shared key, and joins the two
-
❏ C. It hands the whole question to Text2SQL to write one query over everything
-
❏ D. Vector search retrieves both the documents and the warehouse rows
Hint
Where do the warehouse rows live in this workshop - and did they ever move?
Solution
B is correct: the warehouse rows stay in BigQuery - they fail the four-pains test for migration. The agent grounds in the document graph (which documents cover the code, which parts they name), reads the real outcomes from BigQuery on the shared part number, and joins the two in Python. The connections graph from Module 2 hands it the correct join paths.
Why others are wrong:
-
A: that was the original instinct; copying the rows fails sync, performance, modeling, and security
-
C: Text2SQL alone is layer two of the wall - quietly wrong on the multi-hop join chain
-
D: vector search returns passages, not a connected, computed answer
Recall Module 5: "Federate on the shared key."
Why the Money Query Needed a Graph
The final query ranked candidate fixes by real repair outcomes on similar vehicles. Why was Text2SQL the wrong tool for this question?
-
❏ A. SQL cannot express multi-table joins
-
❏ B. The data was too large for a SQL engine
-
✓ C. The question implies a long chain of joins across both halves, where generated SQL tends to fail silently - plausible but subtly wrong
-
❏ D. Text2SQL cannot read part numbers
Hint
Layer two of Sam’s wall: what happens to generated SQL as the join chain grows - and how would you know?
Solution
C is correct: The question spans documents, sections, references, parts, work orders, and vehicles - five or more joins, half of them against tables derived from PDFs. Text2SQL is nondeterministic on chains like this and fails silently: the query runs and returns plausible rows that are subtly wrong. The connections graph hands the agent the exact join paths, so the SQL it writes against the warehouse is grounded, not guessed.
Why others are wrong:
-
A: SQL can express the joins - the problem is reliably generating the right ones from natural language; the connections graph is what makes them reliable
-
B: This dataset is tiny; scale was never the issue
-
D: Part numbers are ordinary strings to any tool
Recall Modules 1 and 5: "Text2SQL is quietly wrong" and the federated finale.
Summary
Congratulations on completing AI on Your Lakehouse: Context Comes in Shapes, Not Queries!
You’ve successfully:
-
Built a navigable document tree from parsed PDFs with shared-key links
-
Surfaced repair themes with Leiden community detection
-
Merged a Delta warehouse into the same graph on shared keys
-
Written multi-hop Cypher that crosses the document-to-table boundary
-
Seen how to port the pattern to your own lakehouse and agents
Continue learning:
-
Neo4j & GenAI Fundamentals - retrievers and GraphRAG
-
Community Detection - deeper into Leiden and Louvain
-
Context Graphs: Agent Memory with Neo4j - persistent, explainable agent memory