RAG with MongoDB / Describe Chunking Strategies

9:30

Chapters

0:00 - Introduction

1:30 - Identifying data sources

2:41 - Sanitize the data

3:56 - Chunk the data

5:58 - Generate metadata

7:39 - Generate embeddings and store data

8:58 - Recap

Large language models are trained on a wide array of diverse data, but there are still gaps in their knowledge. RAG helps fill in these gaps by supplementing the LLMs with additional sources of data. With RAG, we can use all kinds of data. It can be proprietary or public, and come in a variety of unstructured formats like text, video, or even images. You might be asking yourself, how? Well, let's take a look together by preparing the data for our own RAG system. In this lesson we'll learn how to ingest data. Ingestion can be broken down into a three step process: identifying data sources, preparing the data, and storing the data. We'll spend most of our time preparing the data where we'll sanitize it, chunk it into smaller sections, and create embeddings. If these terms are new to you, don't worry. We'll cover everything you need to get started in this video. Before we get into ingestion, let's recap how RAG works. The user provides a query which is vectorized and used to perform a vector search on the chunks of data in our MongoDB database. Chunks of data are returned and combined with the original query to form a prompt which is sent to the LLM. The LLM then uses the chunks of data along with its own data to generate a response to the query. Now let's learn more about identifying preparing and storing data. Just a quick note before we begin. We'll provide all the code you need after this video. And you'll see in the code that we've already set up PyMongo and are connected to our cluster. Let's start by identifying which data sources to use for our RAG system. This is an important step since it'll inform how we go about sanitizing the data in the next step. We want to create an app that can answer questions about MongoDB. However, we don't want to rely only on the LLM's own training data because the LLM wasn't specifically trained to answer questions about MongoDB. The good news is we have tons of documentation and books we can rely on to help answer these questions. For this example, we'll use the little book of MongoDB as our data source since it has a good explanation of MongoDB's functionality. Books come in various formats, but for our purposes we'll use the PDF version. We'll use the PyPDFLoader package to parse the PDF since Langjain has built in support for it. To use it, we specify the location of the PDF inside the loader. Then, we load the contents of the PDF. The package splits it up by page and stores each page as an element in an array. Let's print the first page in the terminal. Here we see that the first page is empty, which makes sense since it's the cover of the book. We can also see that it returns metadata. We'll explore metadata a little later in the video. Now that we have our data, let's sanitize it. Sanitizing is the process of removing any sensitive or unnecessary data. Keep in mind, we'll create embeddings for the data we ingest. We don't wanna compromise performance and accuracy or waste money by generating embeddings for unnecessary data. When working with data in multiple formats, sanitizing it can be tricky. Say you have one unstructured data source that's a PDF and another that's structured like an HTML page. Parsing a PDF is different from parsing an HTML page so it can get complicated. One way to manage this is by converting everything to the same format. Markdown is a good option since it could be programmatically transformed and parsed easily. In our case, we're just using a PDF so we don't have to convert it to markdown but we still have work to do. Remember how the first page of our PDF was empty? Let's make sure we filter out any pages with no content. To do this, we'll create a cleaned pages array for the pages we want to keep. We'll loop through the pages of our PDF. If the page has more than twenty words, we'll push it to the clean pages array. Now let's print the first element of the clean pages array. Instead of returning the empty cover page it returns the first page with content. Nice. Now that we have the data we need it's time to chunk it. In a RAG system, chunking means breaking large text into smaller pieces. This allows the retrieval component of a RAG system to efficiently index and search through the data. Remember, the chunks are used as context in the prompt that is sent to the model. Having small relevant chunks helps the model generate more precise and contextually appropriate responses. There are many ways to chunk data: by sentence, paragraph, page, or section, or even semantically. New chunking methods are constantly being developed. So how do you determine which method to use? You need to experiment. This will help you see what returns the best results for your data and common queries. Sadly, there is no one size fits all solution. Currently, our PDF is split by page. We could keep it like this, but I know that each page has multiple paragraphs that cover different topics. I think it would be best if we chunk each paragraph. That way each chunk is relevant to a specific topic. To do this, I'll import the recursive text splitter package from Lanechain. We'll assign the recursive text splitter to a variable named text splitter. Next, I need to set the size for each chunk and the chunk overlap. I'll estimate the size of a paragraph here. We'll measure each chunk in tokens or words. Each chunk will be a maximum of five hundred tokens, with a one hundred and fifty token overlap with the previous chunk. Token overlap ensures that important contextual information is preserved across different chunks of text. After that, we'll use the split document method from the TextSplitter, pass in the clean pages, and assign it to a variable named split docs which contains our chunks. Let's go ahead and print one of the chunks. Last time we printed the first chunk, so this time let's check out a different one, such as the chunk at index twenty one. After running it, we notice that our chunks are smaller now. Additionally, this specific chunk focuses on a single topic, MongoDB collections, while overlapping sections are about MongoDB databases and documents. Also, we have meaningful chunks. Now it's time to think about metadata. We can leverage metadata to improve performance in our vector search. Atlas Vector Search allows you to pre filter the data before performing a search on the vectors, which narrows the search space and can help you efficiently find relevant results. To take advantage of this, we need meaningful metadata to filter on. Currently, we have metadata on the data source and the page number, but let's collect additional metadata. There are many ways we could do this, but we'll use an LLM to generate additional metadata fields. Let's use the create metadata tagger from Langchain. For the LLM, we'll use GPT, but feel free to use any generative model. Let's put this code between the text splitter and split docs variables. First, we define a schema with the metadata we want, which instructs the LOM on what data to extract from each chunk. We'll get the title, an array of keywords, and a Boolean indicating if the chunk has code or not. Next, we specify the LOM and provide the API key. After that, we use the createMetadataTagger method, passing in the schema and LOM, and assign it to a variable called document transformer. Then, we create a variable named docs by calling the document transformer method with cleanPages as the argument. This generates metadata for each page of the PDF. Finally, we update split docs by passing in the docs variable. This chunks the pages and ensures each chunk has relevant metadata. Let's print the first element in the splitDocs array to see what the metadata looks like. The metadata includes keywords relevant to the chunk along with the title and other fields. Nice! Now that we have our chunks and metadata, let's generate embeddings for each chunk and store them in Atlas. For this we'll use OpenAI's embedding model which we'll import. We'll also import MongoDB Atlas Vector Search from LinkTight. Now let's generate embeddings. First we'll assign the embedding model to a variable named embeddings. After that, we call mongodb atlas vector search, which accepts the chunks we created, the embedding model, and the collection where we'll store our chunks and their embeddings. That's it! Now we can run the script and wait for the magic to happen. It can take some time to ingest all the data. Once the ingestion is complete, we can inspect the data in Atlas. As you can see, the collection has one hundred and seventy three documents. When we click on the collection we could see each chunk along with its metadata and embeddings. This is pretty cool right? Our PDF source probably won't be updated frequently but keeping your chunks up to date could be a concern depending on your source data. One way to keep your chunk data up to date is with a scheduled Atlas trigger or a CronJob. You can pull in new or updated information from your data sources on a set schedule. Then you can chunk it, generate metadata, create embeddings, and store it. This will ensure that your RAG system always has fresh data. Great job on learning a lot of new information. Let's recap what we learned. First, we learned that data ingestion can be broken down into a three step process: identifying the data, preparing it, and storing it. When preparing our data, we sanitized it by removing unnecessary data. Next, we chunked our data, which means that we broke large texts into smaller pieces. This allows the retrieval component of a Rack system to efficiently index and search through the data. Finally, we wrote a script that generates metadata and embeddings for each chunk and then search them into Atlas. Stick around for the next video where we'll build the retrieval component of our rack system. See you there.