Voyage AI with MongoDB / Using vector embeddings with MongoDB

Code Summary: Using vector embeddings with MongoDB

The following provides a summary of the code to generate Voyage AI embeddings and storing them in MongoDB Atlas.

Prerequisites

  • MongoDB Atlas Cluster
  • Python
  • Voyage AI API key

Usage

Setting Up The Environment:

The following loads environment variables from a .env file, then uses them to initialize a Voyage AI client and a MongoDB client, connecting to a specific database and collection. Note that it is recommended to use Secret Store in production, instead of .env.

import voyageai
import pymongo
from dotenv import load_dotenv
import os

load_dotenv()

vo = voyageai.Client()
client = pymongo.MongoClient(os.getenv("MONGODB_URI"))
db = client["mydatabase"]
collection = db["mycollection"]

Creating the Sample Data:

The following defines a list of 15 documents, each with a title, description, and category, spanning five topics: History, Health, Technology, Art, and Gothic architecture.

data = [
    {
        "title": "Ancient Roman Architecture",
        "description": "The Romans perfected the use of arches, vaults, and concrete in their monumental buildings. The Colosseum and Pantheon showcase their engineering brilliance and aesthetic vision.",
        "category": "History",
    },
    {
        "title": "Mediterranean Diet Benefits",
        "description": "A dietary pattern rich in olive oil, fish, vegetables, and whole grains. Studies show it reduces heart disease risk and promotes longevity through anti-inflammatory compounds.",
        "category": "Health",
    },
    {
        "title": "Machine Learning Fundamentals",
        "description": "Algorithms that enable computers to learn from data without explicit programming. Neural networks, decision trees, and gradient descent are core concepts in this field.",
        "category": "Technology",
    },
    {
        "title": "Greek Classical Architecture",
        "description": "Ancient Greek structures featured columns, symmetry, and mathematical precision. The Parthenon exemplifies their dedication to proportion and harmony in building design.",
        "category": "History",
    },
    {
        "title": "Artificial Intelligence in Healthcare",
        "description": "AI systems analyze medical images, predict patient outcomes, and assist in diagnosis. Deep learning models can detect diseases earlier than human experts in some cases.",
        "category": "Technology",
    },
    {
        "title": "Nutritional Science and Longevity",
        "description": "Research links specific eating patterns with extended lifespan. Caloric restriction, antioxidant-rich foods, and healthy fats play crucial roles in cellular health and aging.",
        "category": "Health",
    },
    {
        "title": "Renaissance Art Techniques",
        "description": "Artists developed linear perspective, chiaroscuro, and realistic human anatomy rendering. Masters like Leonardo and Michelangelo revolutionized visual representation.",
        "category": "Art",
    },
    {
        "title": "Deep Learning Neural Networks",
        "description": "Multi-layered networks that process information hierarchically, mimicking brain structure. Convolutional and recurrent architectures excel at image and sequence tasks respectively.",
        "category": "Technology",
    },
    {
        "title": "Plant-Based Nutrition",
        "description": "Diets centered on vegetables, fruits, legumes, and nuts provide fiber, vitamins, and phytonutrients. Research suggests reduced cancer and diabetes risk compared to meat-heavy diets.",
        "category": "Health",
    },
    {
        "title": "Gothic Cathedral Construction",
        "description": "Medieval builders created soaring structures with pointed arches, flying buttresses, and stained glass. Notre-Dame and Chartres demonstrate vertical emphasis and light manipulation.",
        "category": "History",
    },
    {
        "title": "Computer Vision Applications",
        "description": "Systems that interpret visual information from cameras and sensors. Object detection, facial recognition, and autonomous vehicle navigation rely on these technologies.",
        "category": "Technology",
    },
    {
        "title": "Impressionist Painting Movement",
        "description": "Artists like Monet and Renoir captured light and movement through loose brushwork and pure color. They painted outdoors to depict changing atmospheric conditions authentically.",
        "category": "Art",
    },
    {
        "title": "Gut Microbiome and Health",
        "description": "Trillions of bacteria in the digestive system influence immunity, metabolism, and mental health. Fermented foods and diverse fiber sources promote beneficial microbial communities.",
        "category": "Health",
    },
    {
        "title": "Natural Language Processing",
        "description": "Computational techniques for understanding and generating human language. Transformers and attention mechanisms enable translation, summarization, and conversational AI systems.",
        "category": "Technology",
    },
    {
        "title": "Baroque Artistic Drama",
        "description": "Characterized by intense emotion, dramatic lighting, and dynamic movement. Caravaggio and Bernini created theatrical compositions that engaged viewers emotionally.",
        "category": "Art",
    },
]

Generate and Store Vector Embeddings

The following embeds each document's description using voyage-4, attaches the resulting vectors back to their source documents, and upserts all 15 into MongoDB. It then prints a confirmation with the embedding dimensionality.

# Extract text for embedding - use description only
texts_to_embed = [item["description"] for item in data]

# Generate embeddings
result = vo.embed(texts_to_embed, model="voyage-4", input_type="document")

# Add embeddings to the data
for i, item in enumerate(data):
    item["embedding"] = result.embeddings[i]

# Insert into MongoDB
collection.delete_many({})  # Clear existing data
collection.insert_many(data)

print(f"Successfully embedded and inserted {len(data)} documents")
print(f"Sample embedding dimensions: {len(data[0]['embedding'])}")

Define the Vector Search Index

The following creates and registers a vector search index on the collection that holds our documents with the generated embeddings, configured to search the embedding field using dot product similarity over 1024-dimensional vectors.

from pymongo.operations import SearchIndexModel

search_index_model = SearchIndexModel(
  definition={
    "fields": [
      {
        "numDimensions": 1024,
        "path": "embedding",
        "similarity": "dotProduct",
        "type": "vector"
      }
    ]
  },
  name="vector_index",
  type="vectorSearch"
)
collection.create_search_index(model=search_index_model)

Generate the Query Embeddings

The following embeds a single query string using voyage-4 and extracts the resulting vector for use in downstream vector search.

query = "ancient construction methods"
query_embedding = vo.embed([query], model="voyage-4", input_type="query").embeddings[0]

Perform a Vector Search

The following runs a vector search aggregation against the collection holding the data, retrieving the 10 closest documents to the query embedding from 100 candidates, and prints each matching document's title.

results = collection.aggregate([
  {
    "$vectorSearch": {
      "index": "vector_index",
      "path": "embedding",
      "queryVector": query_embedding,
      "numCandidates": 100,
      "limit": 10
    }
  }
])

for doc in results:
    print(doc['title'])

Create an Auto-Embedding Vector Search Index

The following creates a vector search index called vector_index that uses MongoDB's autoEmbed feature to automatically generate voyage-4 embeddings from the description field at index time, eliminating the need to pre-compute and store embeddings manually.

search_index_model = SearchIndexModel(
  definition={
    "fields": [
      {
        "type": "autoEmbed",
        "modality": "text",
        "path": "description",
        "model": "voyage-4"
      }
    ]
  },
  name="vector_index",
  type="vectorSearch"
)
result = collection.create_search_index(model=search_index_model)

Run an Auto-Embedding Vector Search

The following runs a vector search aggregation that automatically embeds the query text using voyage-4 at query time, retrieving the top 10 results from 100 candidates against the vector_index, and prints each matching document.

result = client['mydatabase']['mycollection'].aggregate([
  {
    "$vectorSearch": {
      "index": "vector_index",
      "path": "description",
      "query": {
        "text": "ancient construction methods"
      },
      "model": "voyage-4",
      "numCandidates": 100,
      "limit": 10
    }
  }
])

for i in result:
    print(i)