Semantic Search, Vectors, and Embeddings

Machine learning and natural language processing (NLP) often use vectors and embeddings to represent and understand data.

Semantic search aims to understand search phrases' intent and contextual meaning, rather than focusing on individual keywords.

Traditional keyword search often depends on exact-match keywords or proximity-based algorithms that find similar words.

For example, if you input "apple" in a traditional search, you might predominantly get results about the fruit.

However, in a semantic search, the engine tries to gauge the context: Are you searching about the fruit, the tech company, or something else?

An apple in the middle with a tech icons on the left and a food on the right

What are Vectors

Vectors are simply a list of numbers. For example, the vector [1, 2, 3]` is a list of three numbers and could represent a point in three-dimensional space.

A diagram showing a 3d representation of the x

You can use vectors to represent many different types of data, including text, images, and audio.

Using vectors with a dimensionality of hundreds and thousands in machine learning and natural language processing (NLP) is common.

What are Embeddings?

When referring to vectors in the context of machine learning and NLP, the term "embedding" is typically used. An embedding is a vector that represents the data in a useful way for a specific task.

Each dimension in a vector can represent a particular semantic aspect of the word or phrase. When multiple dimensions are combined, they can convey the overall meaning of the word or phrase.

For example, the word "apple" might be represented by an embedding with the following dimensions:

  • fruit

  • technology

  • color

  • taste

  • shape

You can create embeddings in various ways, but one of the most common methods is to use a large language model.

For example, the embedding for the word "apple" is 0.0077788467, -0.02306925, -0.007360777, -0.027743412, -0.0045747845, 0.01289164, -0.021863015, -0.008587573, 0.01892967, -0.029854324, -0.0027962727, 0.020108491, -0.004530236, 0.009129008, …​ and so on.

You can use the distance or angle between vectors to gauge the semantic similarity between words or phrases.

A 3 dimensional chart illustrating the distance between vectors. The vectors are for the words "apple" and "fruit"

Words with similar meanings or contexts will have vectors that are close together, while unrelated words will be farther apart.

This principle is employed in semantic search to find contextually relevant results for a user’s query.

Continue

When you are ready, you can move on to the next task.

Summary

You learned about semantic search, vectors, and embeddings.

Next, you will use a Neo4j vector index to find similar data.