Voyage AI with MongoDB / Using vector embeddings with MongoDB
Code Summary: Using vector embeddings with MongoDB
The following provides a summary of the code to generate Voyage AI embeddings and storing them in MongoDB Atlas.
Prerequisites
- MongoDB Atlas Cluster
- Python
- Voyage AI API key
Usage
Setting Up The Environment:
The following loads environment variables from a .env file, then uses them to initialize a Voyage AI client and a MongoDB client, connecting to a specific database and collection. Note that it is recommended to use Secret Store in production, instead of .env.
import voyageai
import pymongo
from dotenv import load_dotenv
import os
load_dotenv()
vo = voyageai.Client()
client = pymongo.MongoClient(os.getenv("MONGODB_URI"))
db = client["mydatabase"]
collection = db["mycollection"]
Creating the Sample Data:
The following defines a list of 15 documents, each with a title, description, and category, spanning five topics: History, Health, Technology, Art, and Gothic architecture.
data = [
{
"title": "Ancient Roman Architecture",
"description": "The Romans perfected the use of arches, vaults, and concrete in their monumental buildings. The Colosseum and Pantheon showcase their engineering brilliance and aesthetic vision.",
"category": "History",
},
{
"title": "Mediterranean Diet Benefits",
"description": "A dietary pattern rich in olive oil, fish, vegetables, and whole grains. Studies show it reduces heart disease risk and promotes longevity through anti-inflammatory compounds.",
"category": "Health",
},
{
"title": "Machine Learning Fundamentals",
"description": "Algorithms that enable computers to learn from data without explicit programming. Neural networks, decision trees, and gradient descent are core concepts in this field.",
"category": "Technology",
},
{
"title": "Greek Classical Architecture",
"description": "Ancient Greek structures featured columns, symmetry, and mathematical precision. The Parthenon exemplifies their dedication to proportion and harmony in building design.",
"category": "History",
},
{
"title": "Artificial Intelligence in Healthcare",
"description": "AI systems analyze medical images, predict patient outcomes, and assist in diagnosis. Deep learning models can detect diseases earlier than human experts in some cases.",
"category": "Technology",
},
{
"title": "Nutritional Science and Longevity",
"description": "Research links specific eating patterns with extended lifespan. Caloric restriction, antioxidant-rich foods, and healthy fats play crucial roles in cellular health and aging.",
"category": "Health",
},
{
"title": "Renaissance Art Techniques",
"description": "Artists developed linear perspective, chiaroscuro, and realistic human anatomy rendering. Masters like Leonardo and Michelangelo revolutionized visual representation.",
"category": "Art",
},
{
"title": "Deep Learning Neural Networks",
"description": "Multi-layered networks that process information hierarchically, mimicking brain structure. Convolutional and recurrent architectures excel at image and sequence tasks respectively.",
"category": "Technology",
},
{
"title": "Plant-Based Nutrition",
"description": "Diets centered on vegetables, fruits, legumes, and nuts provide fiber, vitamins, and phytonutrients. Research suggests reduced cancer and diabetes risk compared to meat-heavy diets.",
"category": "Health",
},
{
"title": "Gothic Cathedral Construction",
"description": "Medieval builders created soaring structures with pointed arches, flying buttresses, and stained glass. Notre-Dame and Chartres demonstrate vertical emphasis and light manipulation.",
"category": "History",
},
{
"title": "Computer Vision Applications",
"description": "Systems that interpret visual information from cameras and sensors. Object detection, facial recognition, and autonomous vehicle navigation rely on these technologies.",
"category": "Technology",
},
{
"title": "Impressionist Painting Movement",
"description": "Artists like Monet and Renoir captured light and movement through loose brushwork and pure color. They painted outdoors to depict changing atmospheric conditions authentically.",
"category": "Art",
},
{
"title": "Gut Microbiome and Health",
"description": "Trillions of bacteria in the digestive system influence immunity, metabolism, and mental health. Fermented foods and diverse fiber sources promote beneficial microbial communities.",
"category": "Health",
},
{
"title": "Natural Language Processing",
"description": "Computational techniques for understanding and generating human language. Transformers and attention mechanisms enable translation, summarization, and conversational AI systems.",
"category": "Technology",
},
{
"title": "Baroque Artistic Drama",
"description": "Characterized by intense emotion, dramatic lighting, and dynamic movement. Caravaggio and Bernini created theatrical compositions that engaged viewers emotionally.",
"category": "Art",
},
]
Generate and Store Vector Embeddings
The following embeds each document's description using voyage-4, attaches the resulting vectors back to their source documents, and upserts all 15 into MongoDB. It then prints a confirmation with the embedding dimensionality.
# Extract text for embedding - use description only
texts_to_embed = [item["description"] for item in data]
# Generate embeddings
result = vo.embed(texts_to_embed, model="voyage-4", input_type="document")
# Add embeddings to the data
for i, item in enumerate(data):
item["embedding"] = result.embeddings[i]
# Insert into MongoDB
collection.delete_many({}) # Clear existing data
collection.insert_many(data)
print(f"Successfully embedded and inserted {len(data)} documents")
print(f"Sample embedding dimensions: {len(data[0]['embedding'])}")
Define the Vector Search Index
The following creates and registers a vector search index on the collection that holds our documents with the generated embeddings, configured to search the embedding field using dot product similarity over 1024-dimensional vectors.
from pymongo.operations import SearchIndexModel
search_index_model = SearchIndexModel(
definition={
"fields": [
{
"numDimensions": 1024,
"path": "embedding",
"similarity": "dotProduct",
"type": "vector"
}
]
},
name="vector_index",
type="vectorSearch"
)
collection.create_search_index(model=search_index_model)
Generate the Query Embeddings
The following embeds a single query string using voyage-4 and extracts the resulting vector for use in downstream vector search.
query = "ancient construction methods"
query_embedding = vo.embed([query], model="voyage-4", input_type="query").embeddings[0]
Perform a Vector Search
The following runs a vector search aggregation against the collection holding the data, retrieving the 10 closest documents to the query embedding from 100 candidates, and prints each matching document's title.
results = collection.aggregate([
{
"$vectorSearch": {
"index": "vector_index",
"path": "embedding",
"queryVector": query_embedding,
"numCandidates": 100,
"limit": 10
}
}
])
for doc in results:
print(doc['title'])
Create an Auto-Embedding Vector Search Index
The following creates a vector search index called vector_index that uses MongoDB's autoEmbed feature to automatically generate voyage-4 embeddings from the description field at index time, eliminating the need to pre-compute and store embeddings manually.
search_index_model = SearchIndexModel(
definition={
"fields": [
{
"type": "autoEmbed",
"modality": "text",
"path": "description",
"model": "voyage-4"
}
]
},
name="vector_index",
type="vectorSearch"
)
result = collection.create_search_index(model=search_index_model)
Run an Auto-Embedding Vector Search
The following runs a vector search aggregation that automatically embeds the query text using voyage-4 at query time, retrieving the top 10 results from 100 candidates against the vector_index, and prints each matching document.
result = client['mydatabase']['mycollection'].aggregate([
{
"$vectorSearch": {
"index": "vector_index",
"path": "description",
"query": {
"text": "ancient construction methods"
},
"model": "voyage-4",
"numCandidates": 100,
"limit": 10
}
}
])
for i in result:
print(i)