Voyage AI with MongoDB / What are embeddings?

9:57
Imagine asking a friend for something to quench your thirst on a hot day. And they hand you a cold lemonade, even though you never said lemonade. That's semantic understanding. But how would a computer get the same result when it can only match exact words? You'd have to say lemonade precisely or it would fail. Vector embeddings help machines to bridge this gap. By the end of this video, you'll understand what vector embeddings are, why they matter, and how they transform search capabilities. In this video, we'll explore how vector embeddings work and how they enable semantic similarity, the ability to find related concepts even when exact words aren't used. We'll also see how embedding models generate vector embeddings from data and how these numerical representations capture meaning. Finally, you'll learn that we measure similarity in order to perform a semantic search. So, what are vector embeddings? They're numerical representations of data, like words, sentences, or even images that capture their meaning in a way that allows for semantic understanding. Think of it this way. Vector embeddings turn words and sentences into coordinates in a meaning space. Like putting every concept on a map where similar ideas cluster together. King sits near queen, puppy near kitten, and cold drink on a hot day near lemonade. So instead of matching keywords, we're mapping meaning. So how do we translate a word or phrase into a vector embedding? This is where embedding models come in. Embedding models are specialized AI models trained on a vast corpus of data. Through training, the model discovers patterns in data and encodes them into numerical representations. While you can use any embedding model with MongoDB, here we'll focus on Voyage AI's embedding models. Voyage AI offers many embedding models for different use cases. To access these models, we'll use the Atlas Embedding and Reranking API. This API lets us tap into the latest Voyage AI models for generating embeddings. You have two options. You can make direct HTTPS requests to the API endpoint or use the Python SDK. For convenience, we'll use the Python SDK as it's the common language for AI practitioners. Now let's see how this works in practice. We'll start with an example using the most recent general purpose model, voyage-4. When we pass in the text about Roman architecture to the embedding model, we receive an array of numbers. We'll print the length of the embeddings array to see the number of dimensions, which we'll learn about in a bit. Each number is a coordinate that helps position this phrase in meaning space. Here's the interesting part. These coordinates aren't plotted on a 2D graph. They exist in high dimensional space, often hundreds or even thousands of dimensions. This allows them to capture complex relationships and nuances in meaning in a way that keyword matching cannot. But hundreds of dimensions? How can we even think about that? Let's start simple and build up using a ride hailing application example. Imagine we want to represent different ride types. We could start with two dimensions plotted on a 2D map. One dimension for vehicle size, from compact to large, and another for service tier, from economy to premium. For example, a family car might sit at coordinates [0.5, 0.2], while a luxury vehicle could be at [0.35, 0.8]. And a spacious ride for group travel at [0.9, 0.4]. Notice in these examples that each dimension uses a scale from 0 to 1.0. Where 0 represents the minimum and 1.0 the maximum for that particular feature. For example, vehicle size goes from compact, 0, to large, 1.0. So a smaller number means a smaller vehicle. Service tier goes from economy, 0, to premium, 1.0. So economy rides have lower values and premium rides have higher values. This scaling is intuitive, but the direction could be reversed depending on the application. The important part is consistency in understanding what each value represents. Now let's add a third dimension for eco friendliness. From traditional fuel, 0, to fully electric, 1.0. Our family car might now be at [0.5, 0.2, 0.7]. The luxury vehicle at [0.35, 0.8, 0.1]. And the spacious ride for group travel at [0.9, 0.4, 0.3]. We're starting to move beyond what we can easily draw on paper, but we can still imagine these points floating in 3D space. But real world ride hailing needs to capture much more nuance. We can add dimensions for features like availability time from rush hour to off peak, typical trip distance, passenger capacity, vehicle amenities, and geographic service area. The Roman architecture embedding we generated earlier had over one thousand dimensions. A luxury SUV for airport trips might have a similar representation like in this example, capturing all these characteristics in that high dimensional space. This is how actual embeddings work. When you convert text like comfortable ride for a family of five with luggage into an embedding, you get an array of numbers. The system can then find spacious ride for group travel or luxury vehicle nearby in this space, even if those exact words weren't in your search. This is a simplified example to illustrate the concept. In reality, the dimensions can represent more complex or abstract ideas. Now let's talk about how we use vector embeddings to measure semantic similarity. The closeness between two points in this high dimensional space reflects how closely related their meanings are according to the selected model. The closer the points, the more similar the concepts. Remember our ride hailing example? If family car is at [0.3, 0.2] and luxury vehicle is a[0.35, 0.8], we can calculate the closeness between these points to see how similar they are. Even though they're different ride types, they're closer to each other than spacious ride for group travel at [0.9, 0.4] because they share similar characteristics like vehicle size and service tier. This is semantic similarity in action. Instead of just matching exact words, we're comparing the meaning of concepts. When a customer searches for spacious and comfortable ride, the system can find family car or luxury vehicle nearby in this meeting space, even though the search used completely different words. This allows computers to understand that these concepts are semantically similar. Mathematical operations like cosine similarity, which calculates the angle between two vectors, measure the closeness between points in the meaning space. This translates to semantic similarity. A smaller angle means the concepts are more closely related. You're not limited to cosine similarity. You can also use other similarity measures like Euclidean distance or dot product similarity to find related concepts. For now, the mathematical details and trade offs of these measures are not in scope. But it's important to understand that they help quantify how closely related two concepts are in the embedding space. So how do we use these similarity measures in practice? Let's see it in action. Suppose we want to search for related concepts in a small collection of documents. Our data set includes fifteen entries, each with a title, description, and category. As you can see, we've got an entry on architecture, another on food, and one on machine learning. These examples highlight the diversity of our data and the effectiveness of embeddings in finding connections across very different topics. Let's walk through how we use vector embeddings to search for semantically similar concepts. The vectors are created and used while the program is running, rather than being saved to a database or file. We generate embeddings for each document and the query, then compare them to find the most similar results. This approach works well for small test data sets like we have here. Here's how it works. First, we generate an embedding for each document description in our data set. Each embedding is a unique set of numbers that represents the meaning of that description. Next, we generate an embedding for our search query. This query embedding captures the meaning of what we're looking for. To find the most relevant results, we calculate the similarity between the query embedding and each document embedding. The closer the embeddings are, the more similar they are. Finally, we sort the results by similarity score, so the most relevant documents appear at the top. This makes it easy to find concepts related to our query, even if the exact words aren't used. In our example, we have a few documents. What happens if our corpus is large? Searching through a large number of vectors or recalculating all the embeddings for our corpus every time we want to do a search becomes impractical. This is where more advanced solutions come in. But for now, it's great to see how semantic search works behind the scenes. So there you have it. Vector embeddings transform how computers understand meaning. Let's recap what we've learned. Vector embeddings are numerical representations that capture the meaning of data in a high dimensional space. They enable semantic similarity, which allows applications to understand and compare concepts based on meaning rather than exact word matches. These embeddings are generated using specialized embedding models that learn to represent meaning during training. In our example, we used voyage-4 from Voyage AI. And finally, we can measure how similar two concepts are by calculating the distance between their vectors using techniques like cosine similarity.