Voyage AI with MongoDB / What are embeddings?

Code Summary: What are embeddings?

The following provides a summary of the code to generate Voyage AI embeddings and perform a similarity search.

Prerequisites

Usage

Generate Vector Embeddings:

The following initializes a Voyage AI client, embeds a single document string using the voyage-4 model, and prints the dimensionality of the resulting embedding vector.

import voyageai
vo = voyageai.Client()

document = "The Romans perfected the use of arches, vaults, and concrete in their monumental buildings. The Colosseum and Pantheon showcase their engineering brilliance and aesthetic vision."

embeddings = vo.embed(document,model="voyage-4").embeddings[0]

print(len(embeddings))

Sample Data:

The following defines a list of 15 documents, each with a title, description, and category, spanning five topics: History, Health, Technology, Art, and Gothic architecture.

data = [
    {
        "title": "Ancient Roman Architecture",
        "description": "The Romans perfected the use of arches, vaults, and concrete in their monumental buildings. The Colosseum and Pantheon showcase their engineering brilliance and aesthetic vision.",
        "category": "History",
    },
    {
        "title": "Mediterranean Diet Benefits",
        "description": "A dietary pattern rich in olive oil, fish, vegetables, and whole grains. Studies show it reduces heart disease risk and promotes longevity through anti-inflammatory compounds.",
        "category": "Health",
    },
    {
        "title": "Machine Learning Fundamentals",
        "description": "Algorithms that enable computers to learn from data without explicit programming. Neural networks, decision trees, and gradient descent are core concepts in this field.",
        "category": "Technology",
    },
    {
        "title": "Greek Classical Architecture",
        "description": "Ancient Greek structures featured columns, symmetry, and mathematical precision. The Parthenon exemplifies their dedication to proportion and harmony in building design.",
        "category": "History",
    },
    {
        "title": "Artificial Intelligence in Healthcare",
        "description": "AI systems analyze medical images, predict patient outcomes, and assist in diagnosis. Deep learning models can detect diseases earlier than human experts in some cases.",
        "category": "Technology",
    },
    {
        "title": "Nutritional Science and Longevity",
        "description": "Research links specific eating patterns with extended lifespan. Caloric restriction, antioxidant-rich foods, and healthy fats play crucial roles in cellular health and aging.",
        "category": "Health",
    },
    {
        "title": "Renaissance Art Techniques",
        "description": "Artists developed linear perspective, chiaroscuro, and realistic human anatomy rendering. Masters like Leonardo and Michelangelo revolutionized visual representation.",
        "category": "Art",
    },
    {
        "title": "Deep Learning Neural Networks",
        "description": "Multi-layered networks that process information hierarchically, mimicking brain structure. Convolutional and recurrent architectures excel at image and sequence tasks respectively.",
        "category": "Technology",
    },
    {
        "title": "Plant-Based Nutrition",
        "description": "Diets centered on vegetables, fruits, legumes, and nuts provide fiber, vitamins, and phytonutrients. Research suggests reduced cancer and diabetes risk compared to meat-heavy diets.",
        "category": "Health",
    },
    {
        "title": "Gothic Cathedral Construction",
        "description": "Medieval builders created soaring structures with pointed arches, flying buttresses, and stained glass. Notre-Dame and Chartres demonstrate vertical emphasis and light manipulation.",
        "category": "History",
    },
    {
        "title": "Computer Vision Applications",
        "description": "Systems that interpret visual information from cameras and sensors. Object detection, facial recognition, and autonomous vehicle navigation rely on these technologies.",
        "category": "Technology",
    },
    {
        "title": "Impressionist Painting Movement",
        "description": "Artists like Monet and Renoir captured light and movement through loose brushwork and pure color. They painted outdoors to depict changing atmospheric conditions authentically.",
        "category": "Art",
    },
    {
        "title": "Gut Microbiome and Health",
        "description": "Trillions of bacteria in the digestive system influence immunity, metabolism, and mental health. Fermented foods and diverse fiber sources promote beneficial microbial communities.",
        "category": "Health",
    },
    {
        "title": "Natural Language Processing",
        "description": "Computational techniques for understanding and generating human language. Transformers and attention mechanisms enable translation, summarization, and conversational AI systems.",
        "category": "Technology",
    },
    {
        "title": "Baroque Artistic Drama",
        "description": "Characterized by intense emotion, dramatic lighting, and dynamic movement. Caravaggio and Bernini created theatrical compositions that engaged viewers emotionally.",
        "category": "Art",
    },
]

Calculate the Similarity of Two Vector Embeddings Using dotProduct:

The following initializes a Voyage AI client, embeds a list of documents and a query string separately using voyage-4 (with appropriate input_type), computes dot product similarity scores between the query and each document, and prints the top 5 most similar documents ranked by score.

import voyageai
import numpy as np
from dotenv import load_dotenv
import os
import examples.data as data

# Sample documents
documents = [item["description"] for item in data.data]

query = "ancient construction methods"

# Generate embeddings for documents
doc_embeddings = vo.embed(
    texts=documents,
    model="voyage-4",
    input_type="document"
).embeddings

# Generate embedding for query
query_embedding = vo.embed(
    texts=[query],
    model="voyage-4",
    input_type="query"
).embeddings[0]

# Calculate similarity scores using dot product
similarities = np.dot(doc_embeddings, query_embedding)

# Sort by similarity (np.argsort with negative sign sorts high to low)
ranked_indices = np.argsort(-similarities)

for doc_index in ranked_indices[:5]:  # Top 5 results
    print(f"Document: {documents[doc_index][:100]}...")
    print(f"Similarity score: {similarities[doc_index]:.4f}\n")