- Python
- TypeScript
This embedding function relies on the You can optionally specify the You can view a more complete example chatting over documents with Gemini embedding and language models.For more info - please visit the official Google docs.
google-genai python package, which you can install with pip install google-genai.dimension parameter to control the output dimensionality of the embeddings (supported range: 128–3072):Multimodal Embeddings
TheGoogleGeminiEmbeddingFunction supports the new gemini-embedding-2-preview model from Google. It is Google’s first fully multimodal embedding model that is capable of mapping text, image, video, audio, and PDFs and their interleaved combinations thereof into a single, unified vector space. By natively handling interleaved data without intermediate processing steps, this model simplifies complex pipelines and unlocks new capabilities for RAG, agentic search, recommendation systems, and more.
What are Multimodal Embeddings?
Traditional embedding models work with a single modality—typically text. If you wanted to search across images, you’d need a separate image embedding model, and the two vector spaces wouldn’t be compatible. Searching for “a red sports car” in a text collection and an image collection would require different queries and different indices. Multimodal embeddings solve this by projecting different types of content into the same vector space. A text description like “a chef mixing ingredients in a bowl” and an image of that scene will have similar embeddings—allowing you to:- Search images with text: Find frames in a video that match a natural language description
- Search text with images: Find documents that describe what’s shown in an image
- Cross-modal retrieval: Build unified search experiences across documents, images, videos, and audio
- Simplified pipelines: No need to maintain separate indices or embedding models for different content types
- Video understanding: Search through hours of video content using natural language
- Product search: Find products by uploading a photo or describing what you want
- Document analysis: Search PDFs that contain both text and images
- Agentic applications: Give AI agents the ability to see and reason about visual content
Example: Video Search
In the Chroma Cookbooks repo, we feature an example using multimodal embeddings to search through YouTube videos. The project downloads a video, extracts frames and transcript, embeds everything into a single Chroma collection, and then uses an agentic search loop with Gemini to answer questions about the video. For example, given a cooking video like this apple tart recipe, you can ask questions like:- “How many bowls are shown in the video?”
- “What ingredients are being mixed?”
- “What happens at the end of the video?”
semantic_search tool to query the collection, and can actually see the retrieved images—making it capable of answering visual questions that would be impossible with text-only search.
How It Works
- Video Processing: The video is downloaded with
yt-dlp, frames are extracted at 1-second intervals usingffmpeg, and the transcript is fetched via the YouTube API - Embedding: Each frame is uploaded to Google’s Files API and embedded using
gemini-embedding-2-preview - Storage: Frames are stored as embeddings, and transcript segments are stored as documents (auto-embedded by Chroma) in a collection named
multimodal-video-{video_id} - Agentic Search: Gemini 3.1 Pro runs in a loop with a
semantic_searchtool. When it retrieves image results, the actual images are passed to the model so it can see them
Setup
Use the “Create Database” button on the top right of the Chroma Cloud dashboard, and name your DB
multimodal-video-search (or any name of your choice). If you’re a first-time user, you will be greeted with the “Create Database” modal after creating your account.Once your database is created, choose the “Settings” tab. At the bottom of the page, choose the
.env tab. Create an API key, and copy the environment variables you will need for running the project: CHROMA_API_KEY, CHROMA_TENANT, and CHROMA_DATABASE.Clone the Chroma Cookbooks repo:
To run this project, you will also need a Google AI API key with access to
gemini-embedding-2-preview. Set it in your .env file along with the Chroma credentials:This project uses uv for package management. Install dependencies: