Split Text Into Chunks and Create Embeddings

You next task is to split a piece of text into chunks then create embeddings for each chunk. You will use the Neo4j GraphRAG package for Python and OpenAI to do this.

Getting Started

Open the 1-knowledge-graphs-vectors\create_and_embed_chunks.py file in your code editor.

python
import asyncio

from dotenv import load_dotenv
from neo4j_graphrag.embeddings.openai import OpenAIEmbeddings
from neo4j_graphrag.experimental.components.text_splitters.fixed_size_splitter import (
    FixedSizeSplitter,
)

load_dotenv()

text = """
London is the capital and largest city of both England and the United Kingdom, with a
population of 8,866,180 in 2022. The wider metropolitan area is the largest in Western
Europe, with a population of 14.9 million. London stands on the River Thames in
southeast England, at the head of a 50-mile (80 km) estuary down to the North Sea, and
has been a major settlement for nearly 2,000 years. Its ancient core and financial
centre, the City of London, was founded by the Romans as Londinium and has retained its
medieval boundaries. The City of Westminster, to the west of the City of London, has
been the centuries-long host of the national government and parliament. London grew
rapidly in the 19th century, becoming the world's largest city at the time. Since the
19th century, the name "London" has referred to the metropolis around the City of
London, historically split between the counties of Middlesex, Essex, Surrey, Kent, and
Hertfordshire, which since 1965 has largely comprised the administrative area of Greater
London, governed by 33 local authorities and the Greater London Authority.
"""

# 1. Split text into chunks


# 2. Create embeddings from chunks

Creating Chunks

We’re going to use the first paragraph of the Wikipedia article for London for this challenge. Feel free to change this to something else though if you’d like.

In order to split the text we need to import a text splitter from the Neo4j GraphRAG package for Python.

Here we’ll use the FixedSizeSplitter, which splits text into fixed size chunks of chunk_size characters, with an overlap of chunk_overlap between chunks.

This is a very simple splitter, however our package supports more advanced splitters from LangChain and LlamaIndex.

Add the following to your script and experiment with the chunk_size and chunk_overlap parameters. How does this change the text chunks that are produced?

python
text_splitter = FixedSizeSplitter(chunk_size=100, chunk_overlap=10)
chunks = asyncio.run(text_splitter.run(text=text)).chunks
print(chunks)

Creating Embedding

Next we’ll create embeddings for our chunks.

In order to do this we need an embedding model.

We can use the text-embedding-3-large from OpenAI as our embedding model.

Add the following to your script and run it to view the embedding created for the first chunk.

python
embedder = OpenAIEmbeddings(model="text-embedding-3-large")
for chunk in chunks[:1]:
    print(embedder.embed_query(chunk))

Continue

When you are ready, you can move on to the next task.

Summary

You learned to use Python and the Neo4j GraphRAG library for Python to split text into chunks and create embeddings for them.