RAG with MongoDB / Describe Chunking Strategies

Code Summary: Preparing the Data 

Below is an overview of the code that sanitizes the data, chunks it, generates metadata, creates embeddings, and stores it in MongoDB. You can also view and fork the code from the Curriculum GitHub repository.

Prerequisites

  • Atlas Cluster Connecting String
  • OpenAI API Key
  • Voyage AI API Key

Usage

Install the requirements:

pip3 install langchain langchain_community langchain_core langchain_voyageai langchain_openai langchain_mongodb pymongo pypdf


Create a key_param file with the following content:

MONGODB_URI=<your_atlas_connection_string>
VOYAGE_API_KEY=<your_voyageai_api_key>
LLM_API_KEY=<your_llm_api_key>


Note: Replace the MONGODB_URI, VOYAGE_API_KEY and LLM_API_KEY values with your own values.

Load the sample data into your Atlas Cluster:

python load_data.py


load_data.py file

This code ingests a pdf, removes empty pages, chunks the pages into paragraphs, generates metadata for the chunks, creates embeddings, and stores the chunks and their embeddings in a MongoDB collection.

The embedding model used in the examples below is the voyage-3.5-lite. The voyage-3.5-lite model from Voyage AI is a state-of-the-art embedding model designed for efficient and high-quality text retrieval. It supports multiple embedding dimensions—2048, 1024, 512, and 256—and offers various quantization options, including int8 and binary. Voyage-3.5-lite is suitable for a wide range of domains, such as technical documentation, code, law, finance, web reviews, long documents, and conversations.

To get started with Voyage AI, check out the Voyage AI documentation.

from pymongo import MongoClient
from langchain_openai import ChatOpenAI
from langchain_voyageai import VoyageAIEmbeddings
from langchain_community.vectorstores import MongoDBAtlasVectorSearch
from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.document_transformers.openai_functions import (
    create_metadata_tagger,
)

import key_param

# Set the MongoDB URI, DB, Collection Names

client = MongoClient(key_param.MONGODB_URI)
dbName = "book_mongodb_chunks"
collectionName = "chunked_data"
collection = client[dbName][collectionName]

loader = PyPDFLoader(".\sample_files\mongodb.pdf")
pages = loader.load()
cleaned_pages = []

for page in pages:
    if len(page.page_content.split(" ")) > 20:
        cleaned_pages.append(page)

text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=150)

schema = {
    "properties": {
        "title": {"type": "string"},
        "keywords": {"type": "array", "items": {"type": "string"}},
        "hasCode": {"type": "boolean"},
    },
    "required": ["title", "keywords", "hasCode"],
}

llm = ChatOpenAI(
    openai_api_key=key_param.LLM_API_KEY, temperature=0, model="gpt-3.5-turbo"
)

document_transformer = create_metadata_tagger(metadata_schema=schema, llm=llm)

docs = document_transformer.transform_documents(cleaned_pages)

split_docs = text_splitter.split_documents(docs)

embeddings = VoyageAIEmbeddings(voyage_api_key=key_param.VOYAGE_API_KEY, model="voyage-3.5-lite")


vectorStore = MongoDBAtlasVectorSearch.from_documents(
    split_docs, embeddings, collection=collection
)