Vector Search Performance / Choosing the Right Optimization: Search Nodes, Quantization, Views

6:44
You've confirmed your vector index is too big for the memory you have, and users are feeling it. You have three tools in the toolbox. Dedicated search nodes, vector quantization, and views. Which do you reach for first, and what does that mean for your code and your budget? In this video, we'll step back from the mechanics of each technique and focus on the decision layer. Not new concepts, but choosing and sequencing them for your workload, your team, and your constraints. Before we choose between them, let's orient ourselves with a one line summary of each. Search nodes move mongot onto separate isolated hardware, giving your vector indexes a predictable memory budget free from WiredTiger competition. They don't make the index itself any smaller. Vector quantization reduces the precision of stored values to shrink the mongot RAM footprint. Scalar quantization converts float32 to int8 and reduces RAM by 3.75x, while binary quantization converts float32 to one bit and reduces RAM by 24x. The reductions aren't a clean 4x and 32x because the HNSW graph itself isn't compressed. Views filter the collection before indexing, so documents that don't pass the pipeline never enter the HNSW graph. All three target the same root cause, a vector search index too large to stay in memory. But they pull on different levers infrastructure, data representation, and data shape. The right choice depends on cost, engineering effort, and where in your system you're willing to make a change. Now that we know what each technique does, let's look at where in your system each one lives because the amount of change required varies significantly. Search nodes are an infrastructure only change. You enable dedicated search nodes in Atlas and Atlas deploys separate mongot nodes. Your schema embedding pipeline and $vectorSearch queries are untouched. The hidden work is sizing and cost modeling, which tier fits your index, and how to scale search independently from your database notes. Quantization sits at two different levels. With automatic quantization, you add a quantization field to your index definition and mongot handles the rest, leaving application code largely untouched. You still need a recall test plan before shipping. Precomputed quantization is a different story. Your ingestion pipeline must produce quantized embeddings and your query code must do the same. There's no full precision copy for rescoring unless you store one separately. Treat it like a proper feature project. Views are a data model and query routing change. You define a view with a filter pipeline, build a vector index on that view, and update your application to query the view instead of the base collection. The view itself is cheap to create, but you pay in data modeling work, designing the right filter and routing queries correctly. Get the filter wrong and you either end up not indexing relevant documents or you negate the mongot memory savings. With the change surface mapped, let's look at cost and risk. Search nodes carry the highest infrastructure cost, but the lowest engineering risk. No application code changes, just configuration and sizing. The real effort is sizing and cost modeling. This is a line item that needs to be scoped and approved. Vector quantization shrinks the MongoDB memory footprint, which can delay or avoid the need to move to a larger search node tier. But the engineering effort depends heavily on which approach you take. Automatic quantization is a moderate lift. You update the index definition and put together a test plan to validate recall. Precomputed quantization is a high effort change that touches your ingestion pipeline, your driver code, and your query layer. The quality risk is real for both. Quantization is lossy, and if you don't validate recall carefully, you may ship a degraded search experience without realizing it. Views are cheap from an infrastructure perspective. The risk is in the filter. Too aggressive and you end up not indexing relevant documents. And too weak and you don't get the memory savings. Unsupported operators in the pipeline can also leave your index in a stale state. With those trade offs in hand, the question becomes, when do you reach for each one? Here's how to think about sequencing these optimizations across the lifecycle of a vector search feature. Start quantization experiments early. The worst time to discover binary quantization drops recall by fifteen percent is during a production incident. Run tests on staging indexes before you need them. If quantization is viable, you'll already have data to back the decision. Treat search nodes as a first class option for key workloads, not an afterthought. If vector search is in the critical path and indexes are expected to grow to several gigabytes, factor dedicated search nodes into your production design from the start. The contention and cold start problems from previous lessons get harder to address after users are already affected. Let views evolve with your understanding of the data. Start with a simple filter. Exclude documents without embeddings. Then layer in tenant scoped or segment specific filters as your domain model matures. These techniques aren't mutually exclusive. Teams typically reach for all three. Search nodes to isolate resources, quantization to shrink memory footprint per index, and views to remove documents that shouldn't be indexed. For an urgent production problem, search nodes are often the fastest path to stability. Quantization and views are the deliberate optimizations you phase in behind them. Nice work. Let's recap. Search nodes, quantization, and views all target the same problem. An index that has outgrown the mongot memory budget. But they attack it differently. Search nodes give mongot more memory without changing application code. Quantization reduces how much memory each vector consumes. Views reduce the number of vectors indexed in the first place. If you need relief now, reach for search notes. To shrink memory without changing what you index, start with automatic quantization and build a recall test plan. To shrink memory by indexing less, use views, but budget time for filter design. For the long term, expect to use all three.