You are currently acting as a learner.

Vector Search Performance / Understanding Quantization

5:23

You've got millions of vectors, a RAM limit, and a performance target to hit. Throwing more hardware at the problem will only get you so far. In this video, we'll revisit why vector indexes demand so much RAM and then look at quantization, a powerful lever you have for reducing that RAM requirement without changing your embedding model. To understand quantization, we need to start with why vector indexes demand so much RAM. The numbers here are rough estimates. Actual index size depends on your model, your data, and the HNSW (hierarchical navigable small world) configuration. But the numbers we use here are close enough to help you understand the scale of the problem. Each vector is a list of floating point numbers, one per dimension. At float32 precision, each dimension costs 4 bytes. At 1024 dimensions, a single vector takes roughly 4 kilobytes. Multiply that across hundreds of thousands or millions of documents and the end memory footprint grows fast. Add the HNSW graph overhead on top and a mid-sized production collection can easily push your index well beyond available RAM. As we saw earlier, that's when page faults and unpredictable latency follow. So what can we do about it? That's where quantization comes in. In general, quantization reduces the number of bits used to represent each dimension value. The index itself doesn't get smaller, but the amount of RAM it needs does. You're trading a small amount of precision for a meaningful reduction in memory requirements. It's similar to audio bit depth. A thirty two bit audio file captures more nuance than an eight bit one, but for most listeners, the tune is still recognizable. You're giving up some fidelity in exchange for a much smaller working footprint. MongoDB Vector Search supports two approaches, scalar quantization and binary quantization. Let's talk about scalar quantization first. Scalar quantization or int8 stores each dimension as a 1 byte integer instead of a 4 byte float. That's a 4x reduction in the vector values themselves, which works out to roughly a 3.75x reduction in RAM once you account for the HNSW graph, which isn't compressed. Binary quantization or int1 goes further. Each dimension is stored as a single bit, giving you a 32x reduction in raw vector size and a roughly 24x reduction in RAM with the graph included. mongot holds the quantized vectors and the HNSW graph in search index memory, while the full precision vectors are stored in mongot's own disk structures and read back transiently during rescoring. For binary quantization, those full precision vectors serve an additional purpose. After the initial search, MongoDB uses them in a rescoring step to re rank the top candidates before returning results to your application. MongoDB gives you two ways to apply quantization. You can let the index handle it automatically, or you can ingest precomputed vectors that are already quantized. We'll cover both approaches in the next video. As we alluded to earlier, quantization isn't free. Because you're storing less information per dimension, the similarity scores computed during search become less precise. That reduced precision can lower recall, meaning your search may return fewer of the true nearest neighbors to your query. The degree of impact on quality depends on which type of quantization you use. Scalar quantization has a modest effect and tends to maintain high recall in practice. Binary quantization is more aggressive. The precision loss can be significant. This is why it includes a rescoring step that uses full precision vectors on disk to recover most of the accuracy the initial binary search trades away. For some applications, the trade off in quality is worthwhile due to the significant reduction in memory requirements. The other factor is your embedding model. Some models are trained with quantization in mind, and they hold up much better under binary compression than general purpose models. If you're applying binary quantization, using one of those models gives you better results. So that's the full picture. You have a RAM problem, quantization is your lever. And the choice between scalar and binary comes down to how aggressively you want to compress and how much recall you're willing to trade. Awesome job. Let's take a moment to recap what we've learned. Vector indexes are memory hungry because float32 precision adds up fast at scale. Quantization solves this by reducing the bits per dimension. Scalar quantization cuts RAM by roughly 3.75x and binary quantization cuts it by roughly 24x with a rescoring step to keep accuracy in check. The trade-off is some precision loss, which is more pronounced with binary, especially on models not designed for it. Regardless of what type of quantization you choose, you can apply it automatically via the index or ingest pre-computed vectors.