RAG with MongoDB / Identify RAG Architecture

5:11
If you've tried it with a large language model like GPT or Claude, you may have noticed that the responses aren't all as accurate. Despite being trained on massive and diverse datasets, these LLMs still have limitations. For instance, they lack access to proprietary information and data newer than their last training update. It's akin to viewing a pixelated image of the entire Internet. You can see the basic outlines, but the specifics blur as you zoom in. But what if there was a way to make that image clearer? The good news is that with Retrieval Augmented Generation, we can. In this video, we'll explore what Retrieval Augmented Generation is and how it works. Before we dive into RAG, let's take a moment to discuss why RAG is necessary. LLMs are incredible, but they have limitations. As I mentioned, LLMs don't always provide a correct response. This is commonly referred to as a hallucination. There are a few reasons why an LLM might hallucinate. It could be because of insufficient data or stale data. It could also occur if we've overloaded the LLM's context window. Let's talk more about each of these issues. For example, the model may not provide accurate answers to questions about internal company data. This is because the model was trained on public data, not private data. It will attempt to answer with some general information, but it won't be able to answer accurately. LLMs can also produce incorrect responses when we're inquiring about public data. This is generally because of stale data. It's important to remember that models aren't updated in real time. LLMs are trained on a dataset which includes data up to a specific date, and they don't have accurate information about developments after that date. So you can't depend on an LLM to reply with the most recent information. Finally, a model may hallucinate if its context window is overloaded. The context window is the amount of information in the prompt, like a question and additional context, that the LLM can consider at one time, and this is limited. This is also referred to as the model's token limit. You might be wondering what a token is. A token in this case is a single word. If a model reaches its token limit, it begins forgetting earlier tokens to make room for new ones. This loss of context can result in subpar responses. While these limitations are serious, we can mitigate them by using RAG. RAG lets us use our own data to extend the LLM's knowledge. It does this by combining an embedding model and a generative model to retrieve relevant information from a data set and provide detailed responses based on that information. For example, we could provide our own internal company data to the model as context so its responses to queries about the company are more relevant. We could also provide recent data to the model to make sure its responses are current instead of waiting for new versions of the model. And finally, we could avoid overloading the model's context window by designing our RAG system to use the model's tokens efficiently. We should only provide the most relevant and necessary context to answer the question. This will result in the most accurate responses. Okay, so we know what RAG is and why it improves the accuracy of LLM responses. Now let's see how it works with Atlas. To start, we need to identify relevant data that can be used as context. This data can be in any form such as PDFs, media, MongoDB documents, tabular data, and more. Once we have the data, in most cases it needs to be sanitized and split into chunks. Don't worry if these terms are new to you, we'll learn more about them in future videos. After the chunks are created, we send them to an embedding model to generate embeddings for each chunk. Finally, the chunks and their embeddings are stored in an Atlas cluster. Now, let's examine what happens when a user asks a question. The user's query will be vectorized by an embedding model, and Atlas Vector Search will use it to find chunks of data relevant to the query. Once we've got the relevant chunks, let's piece together the prompts. The prompt consists of the user's query and the relevant chunks which are provided as context in natural language. Then the prompt is sent to the generative model. Note that the generative model is different from the embedding model. The generative model uses the provided context to formulate a response to the query. Get excited because throughout this unit we'll have an opportunity to explore each component of RAG. Before we move on, let's recap what we learned. First, we learned about the limitations of LLMs such as hallucinations, operating off of insufficient or stale data, and token limits. We can solve these problems by using RAG to provide specific data to the model as context along with our queries. After that, we learned that RAG combines both an embedding model and a generative model to retrieve relevant information from a dataset and generate detailed responses based on that information. Finally, we explored how RAG works. In the next video, we'll take a look at some integrations and frameworks that could help us create a RAG system. See you there.