Vector Search Fundamentals / Understand Semantic Search

5:20
How do we make computers understand the meaning of different kinds of unstructured data? This question has plagued developers and data scientists for decades. The good news is that with vectors, we finally made progress in solving this problem. In this video, we'll learn what vectors are and how they work. Vectors are numerical representations of unstructured data like text, images, and audio. The numerical data is stored as an array of floating point values where each value represents a dimension. When you hear people say terms like vectors, vector embeddings, or embeddings, they're referring to this array. These embeddings give us the ability to search and compute unstructured data as easily as we search and compute structured data. That sounds great, but how do we create these vector embeddings? This is where AI comes into the picture. There are different embedding models that generate vector embeddings from source data. Some models are proprietary, while others are open source. For example, Voyage AI offers a diverse range of models for generating vector embeddings. Their Voyage three point five model is optimized for general purpose use, while specialized options like Voyage Law two are tailored specifically for legal document retrieval. For code related tasks, Voyage Code two provides optimized embeddings. A notable recent addition is Voyage context three, which represents a breakthrough in contextualized chunk embedding by capturing both focused chunk level details and global document context in a single pass, making it particularly valuable for RAG systems that require both granular information and broader document understanding. This represents just a sampling of their offerings with new models being added regularly to meet evolving needs. To generate vector embeddings from such a model, we provide unstructured text data, and the model will analyze and encode the data into multidimensional numerical vectors. Now let's take a moment to understand what the vector embeddings represent. The dimensions of a vector embedding represent the features or attributes of the data item in a high dimensional space. Each dimension corresponds to a particular aspect or characteristic of the data, and the value in each dimension represents the strength or presence of that characteristic. For illustration purposes, we'll use two dimensional vectors that can easily be plotted on a two dimensional graph. Easel vectors are rarely two dimensional, although there are some algorithms that can reduce highly dimensional data down to two dimensions. Say we have vector embeddings for the words car, truck, police car, and ambulance. Once the points corresponding to each vehicle's vector embeddings are plotted, we can measure the distance between those points to determine the similarities. But first, there are two things that impact your the embedding model that you use to generate vector embeddings and the similarity function that you use to calculate the similarity between vectors. The embedding model is a deep learning model trained on a large corpus of data. The specific model we choose influences the results because each model is trained differently and can have different numbers of dimensions. So each model will generate different vector embeddings. For instance, one embedding model might think a car and truck are more similar to each other, so it will place the car and truck vectors closer together and further from the police car and ambulance. Another model could find a car more similar to a police car than a truck, so it'll place those two cars closer together and further from the truck and ambulance. Now after the vector embeddings are defined, different functions can be used to calculate the similarities between vectors. These functions could measure the distance between points, measure the angle, or use another method. The similarity function that you choose impacts the results as well. We'll learn more about how distance and similarity are calculated in a later lesson. Another interesting thing that happens with vector embeddings is that as we plot more points, clusters start to appear. These clusters are made up of similar things. For example, we can have a cluster of different types of cars in one area and different types of trucks in another area. So far, we've only looked at this in a two dimensional space, but this isn't an accurate representation. In fact, vectors are mapped in a highly dimensional space. Think hundreds or thousands of dimensions. These dimensions can represent different aspects of the data. When gathering vector embeddings, the number of dimensions is determined by the embedding model that you use. We'll learn more about embedding models when we generate our own vector embeddings. Great job. In this video, we learned that vectors are numerical representations of unstructured data like text, images, and audio. The numerical data is stored as an array of floating point values where each value represents a dimension. Next, we learn that vectors are generated by embedding models. After that, we learn that vector embeddings are mapped to a highly dimensional space. We can use the distance between vectors to calculate similarity.