You are currently acting as a learner.

Vector Search Performance / How Vector Search Works

9:32

You added vector search to your app and it's blazing fast on your laptop. Queries return in milliseconds. Everything feels instant. Staging looks fine too, but a week into production, query latency starts climbing. Then the tickets start coming in. Users are reporting slow responses and you're not sure where to look. In this video, we'll explore how MongoDB Vector Search works under the hood, how vectors are stored, how the default HNSW (hierarchical navigable small world) index type finds results, and why RAM is one of the biggest factors in query latency. Before we talk about what's happening in production, we need to understand the data vector search is working with. A vector is a list of floating point numbers produced by an embedding model. When you pass a piece of text, an image, or any other data to an embedding model, it outputs a vector that encodes the semantic meaning of that data as a point in a high dimensional space. Two pieces of data that are semantically similar will produce vector embeddings that are close together in that space. Within a vector, each number in the list is called a dimension. A vector from Voyage AI's voyage-4 model has 1024 dimensions by default. More dimensions give the model more expressive power to capture meaning, but they also mean more storage and more computation per vector. Now that we know what vectors are, the next question is, how do you search through millions of them fast enough for the search to feel instant? The answer is an algorithm called HNSW, which stands for hierarchical navigable small world. Rather than comparing your query vector against every single stored vector, HNSW builds a multilayered graph where each node represents a vector and connects to its nearest neighbors. When a query comes in, the search engine enters the graph at the top layer and traverses downward, hopping between nodes that are progressively closer to the query vector, finding approximate nearest neighbors in far fewer comparisons than a brute force scan. This is why vector search can return results in milliseconds over large datasets. But there's a catch, one that becomes very visible when your queries are exposed to production workloads where the index size tends to be larger. MongoDB vector search depends on a two process architecture to work. It runs through mongot, a dedicated search engine process that owns and manages all vector and full text search indexes, including the HNSW graph. But mongot doesn't operate in isolation. It relies on mongod, the core database process, to receive queries, coordinate results, and return data to your application. When a vector search query comes in, here's what happens. First, your application generates a query embedding and sends the aggregation pipeline to mongod, which proxies it to its associated mongot process. mongot searches its HNSW index to find the closest stored vectors according to the similarity measurement and returns the matching document IDs and search metadata back to mongod. Finally, mongod performs a full document lookup on those IDs and returns the results to your application. Notice that your application never talks to mongot directly. All communication flows through mongod, But mongot runs on the same node as mongod by default, which means both processes are drawing from the same pool of provisioned RAM. That shared memory is where the contention lives. And understanding how it's divided is key to understanding vector search performance. Here's how that RAM is used. On M40 and larger clusters, MongoDB's WiredTiger storage engine claims roughly 50% of available RAM for its database cache. On M30 and smaller, that drops to 25%. That chunk is effectively reserved for mongod to hold frequently accessed data. The remaining 50% is shared between additional operations. These include mongod operations like sorting, connection handling, and aggregation, mongot's JVM heap, and mongot's index data. Regular mongod operations are critical, but they do not involve search indexes. So let's focus on mongot's behavior. Search index data managed by mongot's lives on the OS page cache. Under the hood, mongot's stores your vector and full text search indexes as Lucene index files on disk. Lucene is the search library mongot's is built on, and it accesses those files using memory mapped IO, meaning the OS treats them as if they were directly addressable in memory. As mongot's reads a vector search index during HNSW traversal, any index pages that are not already resident in memory are loaded from disk into the OS page cache. If the pages are already cached, the OS serves them directly from RAM without additional disk IO. The page cache isn't a fixed allocation though. It's dynamic. The OS fills whatever free memory is available, and under memory pressure, it will evict pages to make room. mongot's Lucene Index data is a candidate for eviction just like anything else. When those pages get evicted, the next HNSW traversal has to fetch them from disk. So mongot's total memory footprint is its JVM heap plus however much of the page cache your search indexes are occupying. Both draw from the same remaining pool after WiredTiger takes its share. Now that you understand the contention, let's try to understand just how big a vector search index can be. Every vector in your index takes up space in memory, and that space is driven by two things. The number of documents in your collection that have vectors, and the number of dimensions your embedding model produces per vector. To get a rough sense of scale, let's walk through a back of the envelope estimate using float32 vectors. Each dimension costs about 4 bytes. So for a 1024 dimension model, a single vector takes up roughly 4 kilobytes. This means that a collection of a hundred thousand documents is around 400 megabytes. A million documents, around 4 gigabytes. And ten million, around 40 gigabytes. These are intentionally simplified numbers. Your actual footprint will vary based on your model's dimensionality, your data, and HNSW graph overhead on top. But they give you a sense of how fast the math compounds as collections grow. We can express this as index size bytes is approximately equal to num vectors times dimensions times bytes per value. This is a rough estimate, not an exact figure, but it's close enough to size your infrastructure and spot potential problems before they happen. So why does all of this point back to RAM? It comes down to how HNSW traversal works at a low level. When mongot traverses the HNSW graph during a query, it hops between nodes in an essentially random access pattern. Each hop lands at a different location in memory, and at each stop, it needs to read a vector value to compute similarity. If that value is in RAM, the read takes nanoseconds. If it has to be fetched from disk, that same read takes microseconds. That's a thousand times slower. And because a single query may traverse dozens to hundreds of graph nodes, one slow disk read per hop adds up fast. This is the RAM rule for vector search. The entire HNSW graph and your vector data need to fit in mongot's available memory for consistently low latency queries. The moment your index exceeds available RAM, mongot starts reading from disk and query latency becomes unpredictable. This is the thread that connects every technique we'll cover in this skill badge. Quantization, partial indexing with views, dedicated search nodes, all of these are ultimately about one thing, keeping your index in RAM. Awesome work. Before we move on to the techniques for optimizing vector search performance, let's recap what we've learned. In this video, we covered what vectors are and how embedding models produce them. We also looked at how the HNSW algorithm enables fast approximate search through a multilayered graph. From there, we explored the two process architecture of mongod and mongot, how no memory is divided between them, and why keeping your Lucene index data in RAM is the defining factor in query latency.