In this lesson, you will learn how to load embeddings into a Neo4j database.
Questions and Answers Dataset
During this module, you will use a dataset of questions and answers from Quora.
The dataset contains 1000 random questions and answers.
The original Quora dataset is unfiltered and contains questions and answers that some may find offensive or inappropriate. The dataset used in this course is filtered for sensitive content. However, some content you may find inappropriate may still exist. Please be aware of this when working with the dataset.
The dataset was filtered by asking an LLM (OpenAI’s GPT-4) to analyze the text for any "sensitive content". You can view the code that filtered the data in the llm-vectors-unstructured repository.
The OpenAI text-embedding-ada-002 model was used to create embeddings for the questions and answers in the dataset.
Using these embeddings, you can find similar questions and answers.
The Quora-QuAD-1000-embeddings.csv file contains the embeddings for the questions and answers in the dataset.
The file has the following structure:
question,answer,question_embedding,answer_embedding
"The question","The answer","[0.1, 0.2, 0.3, ...]","[0.4, 0.5, 0.6, ...]"The solutions/quora_embeddings.py program in the llm-vectors-unstructured repository created the embeddings by calling the OpenAI API for each question and answer, then adding the embeddings to the CSV file.
Load into Neo4j
You will load the data into two nodes, Question and Answer, with a relationship, ANSWERED_BY. The Question and Answer nodes will store the original text and an embedding as properties.
Review the following Cypher statement to load the data into Neo4j and create the nodes and relationships:
LOAD CSV WITH HEADERS
FROM 'https://data.neo4j.com/llm-vectors-unstructured/Quora-QuAD-1000-embeddings.csv' AS row
MERGE (q:Question{text:row.question})
WITH row,q
CALL db.create.setNodeVectorProperty(q, 'embedding', apoc.convert.fromJsonList(row.question_embedding))
MERGE (a:Answer{text:row.answer})
WITH row,a,q
CALL db.create.setNodeVectorProperty(a, 'embedding', apoc.convert.fromJsonList(row.answer_embedding))
MERGE(q)-[:ANSWERED_BY]->(a)You should be able to identify:
- 
That the file is loaded using the LOAD CSVcommand.
- 
The QuestionandAnswernodes are created using theMERGEcommand.
- 
The embeddingproperty is set using thesetNodeVectorPropertyfunction.
- 
The apoc.convert.fromJsonListfunction converts the embedding string to a list of numbers.
- 
The ANSWERED_BYrelationship is created between theQuestionandAnswernodes.
Run the statement to load the data into Neo4j.
You can check the data was loaded correctly by viewing the Question and Answer nodes:
MATCH (q:Question)-[r:ANSWERED_BY]->(a:Answer)
RETURN q,r,a
LIMIT 100You should see Question and Answer node connected with the ANSWERED_BY relationship.
Select a node to view the text and embedding properties.
Validate Results
Once you have imported the data click the Check Database button to verify that the task is complete.
Hint
Run the Cypher to load the CSV data and create the Question and Answer nodes.
Solution
Run this Cypher statement to create the Question and Answer nodes:
LOAD CSV WITH HEADERS
FROM 'https://data.neo4j.com/llm-vectors-unstructured/Quora-QuAD-1000-embeddings.csv' AS row
MERGE (q:Question{text:row.question})
WITH row,q
CALL db.create.setNodeVectorProperty(q, 'embedding', apoc.convert.fromJsonList(row.question_embedding))
MERGE (a:Answer{text:row.answer})
WITH row,a,q
CALL db.create.setNodeVectorProperty(a, 'embedding', apoc.convert.fromJsonList(row.answer_embedding))
MERGE(q)-[:ANSWERED_BY]->(a)Lesson Summary
In this lesson, you loaded a dataset of questions and answers into a Neo4j database.
In the next lesson, you will learn how to create vector indexes to query embeddings.