Vectors and Embeddings

You have learned about embeddings and used them with vector indexes to find similar data.

In this lesson, you will learn how to generate embeddings.

Embedding Models

You would typically use an embedding model to generate an embedding for a piece of data. As you have learned, the data could be anything - text, images, music, video, or any other data type.

Embedding models are widely available for use with different data types. For example, you can use a text embedding model to generate embeddings for text data or an image embedding model to generate embeddings for image data.

It is possible to create an embedding model, but it is easier to use a pre-trained model. Pre-trained models are trained on large datasets and are available for use with different data types.

Here are some well-known embedding models and types:

  • Word2Vec - A model for generating word embeddings, turning words into vectors based on their context.

  • FastText - An extension of Word2Vec, FastText treats each word as composed of character n-grams, allowing it to generate embeddings for out-of-vocabulary words.

  • Node2Vec - An algorithm that computes embeddings based on random walks through a graph.

  • GPT (Generative Pre-trained Transformer) - A series of models (e.g. GPT-4) that use transformers for generating text that you can also use for generating embeddings.

  • Universal Sentence Encoder - Designed to convert sentences into embeddings.

  • Doc2Vec - An extension of the Word2Vec model to generate embeddings for entire documents or paragraphs, capturing the overall meaning.

  • ResNet (Residual Networks) - Primarily used in image processing, ResNet models can also be used to generate embeddings for images that capture visual features and patterns.

  • VGGNet - VGGNet models are used in image processing to generate embeddings for images, capturing various levels of visual information.

Creating Embeddings

Each embedding model is different and captures different aspects of the data. As such, you cannot compare embeddings created by different models.

You need to use the same model to generate the embeddings for the data you want to compare.

Many embedding models provide APIs that you can use to generate embeddings for your data.

In a previous lesson, you looked at embeddings for movie plots. Open AI’s text-embedding-ada-002 model generated those embeddings.

The code to generate the embeddings loaded the text for each movie plot and sent it to the model to generate the embeddings.

Check Your Understanding

Comparing embeddings

True or False - you can compare embeddings from different models to find similarities in unstructured data.

  • ❏ True

  • ✓ False

Hint

Each embedding model captures different aspects of the data.

Solution

The statement is False. Each embedding model is different and captures different aspects of the data. As such, you cannot compare embeddings created by different models.

Lesson Summary

In this lesson, you learned about embeddings models and how they generate embeddings for different data types.

In the next lesson, you will use Cypher to load a dataset of embeddings into Neo4j.