RAG with MongoDB / Describe Chunking Strategies
Code Summary: Preparing the Data
Below is an overview of the code that sanitizes the data, chunks it, generates metadata, creates embeddings, and stores it in MongoDB. You can also view and fork the code from the Curriculum GitHub repository.
Prerequisites
- Atlas Cluster Connecting String
- OpenAI API Key
Usage
Install the requirements:
pip3 install langchain langchain_community langchain_core langchain_openai langchain_mongodb pymongo pypdf
Create a key_param file with the following content:
MONGODB_URI=<your_atlas_connection_string>
LLM_API_KEY=<your_llm_api_key>
load_data.py file
This code ingests a pdf, removes empty pages, chunks the pages into paragraphs, generates metadata for the chunks, creates embeddings, and stores the chunks and their embeddings in a MongoDB collection.
from pymongo import MongoClient from langchain_openai import OpenAIEmbeddings, ChatOpenAI from langchain_community.vectorstores import MongoDBAtlasVectorSearch from langchain_community.document_loaders import PyPDFLoader from langchain.text_splitter import RecursiveCharacterTextSplitter from langchain_community.document_transformers.openai_functions import ( create_metadata_tagger, ) import key_param # Set the MongoDB URI, DB, Collection Names client = MongoClient(key_param.MONGODB_URI) dbName = "book_mongodb_chunks" collectionName = "chunked_data" collection = client[dbName][collectionName] loader = PyPDFLoader(".\sample_files\mongodb.pdf") pages = loader.load() cleaned_pages = [] for page in pages: if len(page.page_content.split(" ")) > 20: cleaned_pages.append(page) text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=150) schema = { "properties": { "title": {"type": "string"}, "keywords": {"type": "array", "items": {"type": "string"}}, "hasCode": {"type": "boolean"}, }, "required": ["title", "keywords", "hasCode"], } llm = ChatOpenAI( openai_api_key=key_param.LLM_API_KEY, temperature=0, model="gpt-3.5-turbo" ) document_transformer = create_metadata_tagger(metadata_schema=schema, llm=llm) docs = document_transformer.transform_documents(cleaned_pages) split_docs = text_splitter.split_documents(docs) embeddings = OpenAIEmbeddings(openai_api_key=key_param.LLM_API_KEY) vectorStore = MongoDBAtlasVectorSearch.from_documents( split_docs, embeddings, collection=collection )