RAG with MongoDB / Describe Chunking Strategies

7:13
By now, you're familiar with how a RAG system works. In this process, how you break up your data into chunks matters. The strategy that you choose can determine how accurate and relevant the LLMs responses will be to the user's query. In this video, we're going to discuss why chunking matters and the components that make up a chunking strategy. Let's get started. Chunking is the process of breaking up content into smaller parts. As mentioned earlier, those parts are stored in a database like an Atlas cluster. We can then use Atlas vector search to find relevant chunks based on a user's query. Chunks can be the exact data from the source document or include some transformed version that provides more useful information to the LLM. For example, imagine our source data is code for a mobile app. Some chunking strategies use LLMs to summarize each chunk of the code that is created when we break up the source data. Then vector embeddings are created for these text summaries and stored in a document alongside the corresponding chunk of raw code. We could have a summary that tells us something like the code in this chunk is an excerpt from a file that handles the mobile app's network connections, and so on. Now if we're searching for code related to network connections, it is more likely that relevant chunks will be retrieved. Summarization, like in this example, is a more advanced technique, so we won't cover it in this lesson. But it's important to know what's possible when you're considering chunking strategies for your use case. And to that end, it's also important to keep in mind that modern AI models are increasingly multimodal, meaning that they can understand more than just text. So while this video will cover simple code and text examples, you can chunk data that includes images, videos, and audio too. You may be wondering, why do we have to chunk at all? Why don't we just feed the LLM the entire source document and let it determine what is relevant to the user's query? Well, we chunk for two reasons. First, LLMs have context windows and we chunk data so that it can fit into an LLM's context window. The context window is the maximum number of tokens that the LLM can consider at one time. Each token is a group of roughly four characters, so the size of the context window directly limits how much input you can give the model. If a model reaches its token limit, it will either begin to forget earlier tokens to make room for new ones or stop generating and return its response immediately even if it's midway through a word or sentence. So instead of inputting a single document with thousands or millions of tokens, which could exceed the LLM's token limit and overload the context window, we break the source document into chunks that each contain fewer tokens. This way, we only send relevant chunks to the generative model that fit in the context window. For example, imagine our LLM has a token limit of thirty two thousand tokens and our source has fifty thousand tokens worth of data. Instead of feeding the l m the source with fifty thousand tokens and overflow in its context window, we could break up the source into smaller chunks using a chunk size that makes sense for our data. For example, we could end up with fifty chunks with each chunk containing approximately one thousand tokens. Then we'd send only the most relevant chunks which will now easily fit into the context window along with the necessary metadata. You might have heard that the amount of data that LLMs can process is constantly growing. For example, at the time of this video, several LLMs can fit well over a hundred thousand tokens in their context window. So will we still need to chunk data as LLMs improve? The answer is yes, because chunking also improves the accuracy, relevancy, and precision of results. Even with a larger context window, LLMs are slow to ingest tokens. They can still forget tokens if they are ingesting a large dataset because it is not obvious which tokens are most important, and they require enormous resources. Maybe you've heard of the needle in the haystack test that is used to evaluate LLM RAG system performance. This test assesses an LLM's ability to retrieve or recognize a specific piece of information from a vast amount of data and unstructured text. Breaking a large source of data into small chunks can help your RAG application pass this test by more efficiently searching through and retrieving only the most relevant data. Now that we know why we chunk, let's look at how we would develop a strategy to break up our source data. All chunking strategies have three components that you can adjust to fine tune results. Chunk size, chunk overlap, and a splitting technique. Chunk size is the maximum number of tokens contained in each chunk. This value should always be less than the token limit for the l o m that you are using in order for your data to fit in the context window. And in most cases, we will end up sending multiple chunks with metadata, so all of that data must fit comfortably in the context window. So if an LLM has a context window of two thousand and forty eight tokens, your chunk should probably be no bigger than two to three hundred tokens. Just keep in mind that the smaller the chunk size, the more chunks you will have which could require additional memory and storage. Chunk overlap is the number of overlapping tokens between two adjacent chunks. This overlap will create duplicate data across chunks. Overlap can help preserve context between chunks and improve your results. A larger chunk overlap will result in chunks sharing more common tokens, while a smaller chunk overlap will result in chunks sharing fewer common tokens. While this duplication might appear unusual to the human eye, it significantly increases the chances of generating a more contextually rich prompt for the LLM. It also makes it less likely that we will send incomplete information to an LLM. However, depending on your use case, you may end up using a strategy where it doesn't make sense to have overlapping chunks. And finally, the splitting technique determines where one chunk will end and the next will begin. Splitting techniques can range from naive, like splitting a text by character or token, to incredibly complex, like using an LLM to semantically split your data for you. To make our chunking strategy, we will combine these three components. Let's recap what we covered. Chunking takes place in the transformation stage of the RAG data ingestion pipeline. It is the process of breaking up source data into smaller chunks. We chunk for two reasons, to break data into smaller chunks that can easily fit into an LLM's context window, and to improve the accuracy, relevancy, and precision of results. All chunking strategies have three key components that you can adjust to fine tune results, chunk size, chunk overlap, and splitting technique. Once you decide on a chunking strategy or strategies for your application, it's important to continue to evaluate them.