In this lesson, you will learn how to load embeddings into a Neo4j database.
Questions and Answers Dataset
During this module, you will use a dataset of questions and answers from Quora.
The dataset contains 1000 random questions and answers.
The original Quora dataset is unfiltered and contains questions and answers that some may find offensive or inappropriate. The dataset used in this course is filtered for sensitive content. However, some content you may find inappropriate may still exist. Please be aware of this when working with the dataset.
The dataset was filtered by asking an LLM (OpenAI’s GPT-4) to analyze the text for any "sensitive content". You can view the code that filtered the data in the llm-vectors-unstructured repository.
The OpenAI text-embedding-ada-002
model was used to create embeddings for the questions and answers in the dataset.
Using these embeddings, you can find similar questions and answers.
The Quora-QuAD-1000-embeddings.csv
file contains the embeddings for the questions and answers in the dataset.
The file has the following structure:
question,answer,question_embedding,answer_embedding
"The question","The answer","[0.1, 0.2, 0.3, ...]","[0.4, 0.5, 0.6, ...]"
The solutions/quora_embeddings.py
program in the llm-vectors-unstructured repository created the embeddings by calling the OpenAI API for each question and answer, then adding the embeddings to the CSV file.
Load into Neo4j
You will load the data into two nodes, Question
and Answer
, with a relationship, ANSWERED_BY
. The Question
and Answer
nodes will store the original text
and an embedding
as properties.
Review the following Cypher statement to load the data into Neo4j and create the nodes and relationships:
LOAD CSV WITH HEADERS
FROM 'https://data.neo4j.com/llm-vectors-unstructured/Quora-QuAD-1000-embeddings.csv' AS row
MERGE (q:Question{text:row.question})
WITH row,q
CALL db.create.setNodeVectorProperty(q, 'embedding', apoc.convert.fromJsonList(row.question_embedding))
MERGE (a:Answer{text:row.answer})
WITH row,a,q
CALL db.create.setNodeVectorProperty(a, 'embedding', apoc.convert.fromJsonList(row.answer_embedding))
MERGE(q)-[:ANSWERED_BY]->(a)
You should be able to identify:
-
That the file is loaded using the
LOAD CSV
command. -
The
Question
andAnswer
nodes are created using theMERGE
command. -
The
embedding
property is set using thesetNodeVectorProperty
function. -
The
apoc.convert.fromJsonList
function converts the embedding string to a list of numbers. -
The
ANSWERED_BY
relationship is created between theQuestion
andAnswer
nodes.
Run the statement to load the data into Neo4j.
You can check the data was loaded correctly by viewing the Question
and Answer
nodes:
MATCH (q:Question)-[r:ANSWERED_BY]->(a:Answer)
RETURN q,r,a
LIMIT 100
You should see Question
and Answer
node connected with the ANSWERED_BY
relationship.
Select a node to view the text
and embedding
properties.
Validate Results
Once you have imported the data click the Check Database button to verify that the task is complete.
Hint
Run the Cypher to load the CSV data and create the Question
and Answer
nodes.
Solution
Run this Cypher statement to create the Question
and Answer
nodes:
LOAD CSV WITH HEADERS
FROM 'https://data.neo4j.com/llm-vectors-unstructured/Quora-QuAD-1000-embeddings.csv' AS row
MERGE (q:Question{text:row.question})
WITH row,q
CALL db.create.setNodeVectorProperty(q, 'embedding', apoc.convert.fromJsonList(row.question_embedding))
MERGE (a:Answer{text:row.answer})
WITH row,a,q
CALL db.create.setNodeVectorProperty(a, 'embedding', apoc.convert.fromJsonList(row.answer_embedding))
MERGE(q)-[:ANSWERED_BY]->(a)
Lesson Summary
In this lesson, you loaded a dataset of questions and answers into a Neo4j database.
In the next lesson, you will learn how to create vector indexes to query embeddings.