Vector Search Performance / Implementing Quantization

9:17
You know quantization can dramatically shrink your search index memory. Now let's actually build it. In this video, we'll look at the two approaches to applying quantization in MongoDB vector search, automatic quantization and precomputed quantization. Before we write any code, let's quickly orient around the two approaches we'll be implementing. With automatic quantization, you store raw float vectors in MongoDB as you normally would and add a quantization field to your index definition. mongot handles the quantization at index build time requiring no changes to your ingestion pipeline, and it works on existing collections right away. With precomputed quantization, you quantize the vectors before they ever reach MongoDB. The quantized vectors are stored directly in your documents, which reduces both the disk footprint in mongod and the RAM the index needs in mongot. We'll get into the specific packages and serialization format that makes this possible when we implement that approach. Both approaches share the same client setup, so let's look at that first. PyMongo is our MongoDB driver for Python, and Voyage AI is the SDK for Voyage AI, MongoDB's embedding provider, which we'll use to generate vectors. The client setup is identical for both approaches. The difference is in how embeddings are generated and stored, which we'll see as we work through each one. Now that we're set up, let's look at the first approach, automatic quantization. For this, we'll call Voyage AI's embed method with our list of descriptions, getting back a list of full precision float vectors and inserting them into MongoDB as is. Nothing unusual here. The quantization happens in the index definition. Notice the quantization field. That single quantization field is all it takes. It accepts either binary or scalar. We're using binary here, but you can swap in scalar for a more conservative reduction in precision. When mongot builds the index, it reads the float32 vectors from your documents and quantizes them at index time. Your documents stay untouched. Full precision vectors remain on disk in MongoDB. What changes is the index mongot builds. Storing quantized vectors instead of full precision vectors produces smaller index files, which requires less physical memory to load and maintain. With binary quantization specifically, mongot retrieves the full precision vectors from disk to rescore candidates before returning results. With scalar quantization, the full precision vectors on disk are only used during exact or ENN search. One thing to keep in mind, automatic quantization only applies when the stored vectors are floats, either as plain arrays or BSON BinData float32. If a document's field contains int8 or int1 BinData vectors, MongoDB will silently skip that vector entirely rather than indexing it. The index is still built, but those documents contribute nothing to it. Before we look at precomputed quantization, we need to understand the storage format that makes it possible, BSON BinData vectors. BinData is a BSON data type for storing binary data. MongoDB introduced a specific subtype for vectors, subtype 0x09, designed to store embeddings in a compact binary format. Rather than encoding each dimension as a text readable float, BinData packs the vector values directly into a binary byte sequence that MongoDB can read and deserialize much more efficiently. BinData supports three element types, float32, int8, and int8, each determining how many bits are used per dimension and how much space the vector occupies in storage. To serialize vectors as BinData, you need a supported driver. PyMongo version, v4.10+ or newer provides the necessary helper function. Other supported drivers include Node.js version, v6.11+ and newer, Java version, v5.3.1+ and newer, plus more. For a full list, check out the documentation. Now that we understand BinData, let's put it to use. With precomputed quantization, we quantize the vectors before they ever reach the database. That means we need two additional imports, Binary and BinaryVectorDtype from bson.binary. We'll now define two helper functions. The first one, generate_embeddings, is used to create embeddings at a specified data type using the Voyage AI embed method. Note that we are passing output_dtype="int8" (we pass Int8 to the helper function, generate_embeddings, in the parameter, dtype) to instruct the model to return integer values directly instead of floating point values. The second helper function, generate_bson_vector, takes an int8 vector along with its type and serializes it into bin data using the Binary.from_vector method imported earlier. With our helper functions in place, let's put them to use. First, we call generate_embeddings to turn each text description into a 1024 dimension int8 vector using the voyage-4 model. This is where quantization happens. voyage-4 is a quantization aware trained model, meaning it produces quantized int8 vectors directly rather than requiring a separate quantization step. After that, we call generate_bson_vector, which converts each of those int8 vectors into MongoDB's binary vector format, BinData subtype 0x09. Storing vectors in this format allows MongoDB to store them more efficiently on disk compared to standard float arrays. Next, it builds a list of documents where each original record is merged with its corresponding bson_int8_embedding field holding that BinData vector. Finally, we bulk insert all of these documents with quantized vectors into the collection so they're ready for vector search. Now let's look at the index definition to see if we need to change anything. There's no quantization field here. MongoDB sees int8 BinData in the document and indexes it directly with no index time quantization step needed because the vectors are already quantized. The trade off compared to automatic quantization is that there's no full precision copy stored anywhere, so rescoring isn't available. If you need rescoring, the pattern is to keep a separate float32 field and apply automatic binary quantization on that field instead. Querying against a pre computed BinData index requires one important extra step compared to a standard vector search. The query vector must be in the same BinData format as the stored vectors. Let's take a look at what this looks like in practice. Similar to earlier, when we generated vectors for the documents, but this time we specify the query, set the input_type to "query", and set the output_dtype to "int8" to match our document's dtype. We'll also have to serialize our query using the Binary.from_vector method. Once that is completed, you use it in the $vectorSearch stage as usual by supplying it as the queryVector field value. Now that we've implemented both approaches, let's talk about when to choose each one. Start with automatic quantization. It's the lowest friction path to reducing the RAM and page fault problems we identified in the diagnostics lesson. You don't need to change your ingestion pipeline, you don't need a special driver version, and it works on existing collections right away. That makes it the right first move in most situations. Move to precomputed quantization when mongod storage is also a concern. Just know that it comes with real pipeline complexity. You need a supported driver, a serialization step to convert vectors to BSON BinData format before ingestion, and matching BinData formatting on every query vector. One last thing on the recall side. Quantization is lossy by definition, and how much recall you lose depends on which type you use and how well your model handles quantization. Before deploying either approach to production, validate your recall by comparing ANN search results against ENN search results on representative sample of queries. If recall drops more than you're comfortable with, Scalar is the safer fallback over binary. Nice work. You've implemented both quantization approaches, seen how bin data vectors work, and learned how to query against a precomputed index. Whichever approach you use, validate your recall before going to production.