Knowledge Check

Test your understanding of the three shapes and the questions they answer.

The Context Problem

Sam’s vector search returned passages that were semantically similar to the technician’s question, but it didn’t retrieve critical linked documents and sources, resulting in the copilot providing an unreliable response. Which of the below is a good hypothesis to examine first?

❏ A. The embedding model was too small
✓ B. The answer needs a connected set of documents and records - not just chunked passages
❏ C. Vector search cannot index PDF files
❏ D. The PDFs were not chunked at the right size

Hint

The workshop calls this "right meaning, wrong shape" - what does the answer to Dani’s question actually look like?

Solution

B is correct: Dani’s question needs a connected set - the bulletin, the part it names, the work orders that used that part, and the vehicles they were on. Vector search returns disconnected passages ranked by similarity; no amount of embedding quality changes the shape of what it returns.

Why others are wrong:

A: A better model returns better-ranked paragraphs - still paragraphs
C: Vector search indexes parsed PDF text without difficulty
D: Chunking affects retrieval quality, not the fundamental shape mismatch

Recall Module 1: the three questions that break the stack - vector search hands back a pile of similar passages, not the connected set.

The Tree Shape

What does the (Library)-[:HAS*]→(Section) containment tree give an agent that keyword or vector search cannot?

❏ A. Faster text matching
❏ B. Automatic summarization of each document
✓ C. The ability to navigate library and document structure - tables of contents, chapters, and what a document covers
❏ D. Smaller storage requirements

Hint

Think about the first query you ran against the Falcon manual in Module 3.

Solution

C is correct: The variable-length pattern [:HAS*] walks the library’s structure, producing views like a table of contents or "every section under the Engine chapter". Search tools have no concept of "contains" - they can only rank fragments.

Why others are wrong:

A: The tree is about structure, not matching speed
B: Summarization is an LLM task; the tree tells the LLM what to read
D: Storing structure adds nodes and relationships - it does not reduce storage

Recall Module 3: The table-of-contents query over the Falcon manual.

Meaningful Themes

Why did Leiden’s document communities correspond to real repair themes instead of arbitrary clusters?

❏ A. Leiden is guaranteed to find meaningful clusters in any graph
❏ B. The gamma parameter forces communities to match business topics
✓ C. Every edge in the projection is a cross-reference between documents (LINKS_TO), so a dense cluster is documents that link to each other and converge on the same targets - a real repair topic
❏ D. The sections were manually tagged with topics before the workshop

Hint

The algorithm only sees nodes and links. What made each link exist in the first place?

Solution

C is correct: Community detection finds dense clusters - the meaning comes from the edges. Because the projection’s edges are the documents' cross-references (Leiden over LINKS_TO), a dense cluster can only mean "documents that keep citing each other and converging on the same references" - a real repair theme. (Identifiers like part numbers and trouble codes live in the section text and the BigQuery rows, not in :Part or :DTC glue nodes; the document graph is domain-agnostic.)

Why others are wrong:

A: On arbitrary links (or random similarity), the clusters would be arbitrary too
B: Gamma controls granularity (how many themes), not meaning
D: Nothing was tagged - surfacing unnamed patterns was the point

Recall Module 4: "Why these clusters mean something."

Answering With More Than One Shape

The finale answers a question that needs both the documents and the warehouse. How does the agent do it?

❏ A. It merges the warehouse rows into Neo4j and traverses one graph
✓ B. It grounds candidates in the Neo4j document graph, then reads the live warehouse rows from BigQuery on the shared key, and joins the two
❏ C. It hands the whole question to Text2SQL to write one query over everything
❏ D. Vector search retrieves both the documents and the warehouse rows

Hint

Where do the warehouse rows live in this workshop - and did they ever move?

Solution

B is correct: the warehouse rows stay in BigQuery - a system of record already owns them, so there is no reason to migrate. The agent grounds in the document graph (which documents cover the code, which parts they name), reads the real outcomes from BigQuery on the shared part number, and joins the two. The connections graph from Module 2 hands it the correct join paths.

Why others are wrong:

A: that was the original instinct; copying the rows means keeping a second copy in sync with the system of record that already owns them
C: Text2SQL alone guesses the joins - quietly wrong on the multi-hop join chain as it grows
D: vector search returns passages, not a connected, computed answer

Recall Module 6: it takes more than one shape.

Why the Joins Needed a Graph

To connect records on a real warehouse, the agent joins across many tables. Why is Text2SQL alone the wrong tool at that scale, and what makes the connections shape more reliable?

❏ A. SQL cannot express multi-table joins
❏ B. The data is too large for a SQL engine
✓ C. As the join chain grows across many look-alike tables, generated SQL guesses the wrong joins and fails silently; the connections graph hands the agent the exact foreign-key join paths, so the SQL is grounded, not guessed
❏ D. Text2SQL cannot read part numbers

Hint

What happens to generated SQL as the schema grows large and the tables look alike - and how would you even know the answer was wrong?

Solution

C is correct: On a production warehouse with hundreds of wide, near-identical tables, the schema does not fit in a prompt. Text2SQL brute-forces metadata, picks a wrong-but-plausible join, and returns rows that are subtly wrong - on independent benchmarks, a 78.57% error rate on queries touching four or more tables^[1]. The connections graph retrieves the right tables and join paths by meaning before any SQL is generated, so the agent follows the foreign keys instead of guessing them.

Why others are wrong:

A: SQL can express the joins - the problem is reliably generating the right ones from natural language; the connections graph is what makes them reliable
B: The number of rows was never the issue; the breadth of the schema is
D: Part numbers are ordinary strings to any tool

(AutoFix’s six-table warehouse is small enough that Text2SQL copes - this failure shows up at production scale, which is exactly where the connections shape earns its keep.)

Recall Modules 2 and 6: the connections shape and the finale’s estate questions.

Summary

Congratulations on completing AI on Your Lakehouse: Context Comes in Shapes, Not Queries!

You’ve successfully:

Built a navigable document tree from parsed PDFs with shared-key links
Surfaced repair themes with Leiden community detection
Federated the BigQuery warehouse with the graph, with nothing migrated
Seen how to port the pattern to your own lakehouse and agents

Continue learning:

Neo4j & GenAI Fundamentals - retrievers and GraphRAG
Community Detection - deeper into Leiden and Louvain
Context Graphs: Agent Memory with Neo4j - persistent, explainable agent memory

1. Independent Text2SQL benchmarks: Falcon: A Comprehensive Chinese Text-to-SQL Benchmark for Enterprise-Grade Evaluation (Luo et al., arXiv:2510.24762, 2025) reports 78.57% error on queries touching four or more tables - https://arxiv.org/abs/2510.24762; Spider 2.0: Evaluating Language Models on Real-World Enterprise Text-to-SQL Workflows (arXiv:2411.07763, ICLR 2025) reports frontier models at 10-17% on real enterprise schemas - https://arxiv.org/abs/2411.07763.

AI on Your Lakehouse: Context Comes in Shapes, Not Queries

The Context Problem

Connections - the structured shape

Navigate What’s There - Table of Contents

Surface Themes - Communities

Graph Reasoning with neo4j-cli

Put It Together - the finale

Port the Pattern

Knowledge Check

The Context Problem

The Tree Shape

Meaningful Themes

Answering With More Than One Shape

Why the Joins Needed a Graph

Summary

Chatbot

Data Model