Chroma Sync is a managed ingestion service for Chroma Cloud. Point a source — an S3 bucket, a GitHub repository, a website, or an individual file upload — at a Chroma database, and Chroma parses, chunks, embeds, and indexes the data into a collection that’s ready to query. No ingest infrastructure to write, no embedding API keys to manage. The Sync API is available to all Chroma Cloud users and the first $5 of usage is free with a new account.Documentation Index
Fetch the complete documentation index at: https://docs.trychroma.com/llms.txt
Use this file to discover all available pages before exploring further.
How Chroma Sync Works
Sync runs the same pipeline regardless of source:- Managed ingestion. Connect a source once; every invocation runs through Chroma’s queue-based pipeline with automatic retries, rate-limit awareness, and error recovery. Monitor invocations in the dashboard or through the Sync API.
- High throughput. The pipeline is designed to maximize throughput without dropping work, whether you’re syncing a handful of files or millions of documents.
- Parse. Best-in-class PDF and document parsing. PDFs, Office documents, HTML, ebooks, and images are converted to clean markdown with tables, headings, lists, and layout preserved — so chunks reflect the actual structure of the document, not just the raw text stream. Images inside documents are described in text so their content remains searchable. Code files are kept as-is.
- Chunk. Tree-sitter syntax-aware chunking for code; structured markdown chunking for documents; line-based fallback for plain text. The strategy is configurable per source.
- Embed. Dense embeddings are generated automatically with Qwen3-Embedding-0.6B. Optional sparse embeddings are available via Splade or BM25. No extra API keys needed.
- Index. Output is written into the target Chroma collection, ready for vector, full-text, regex, sparse, and hybrid search.
Source Types
Chroma Sync supports four source types. Each has its own walkthrough and configuration reference:- S3 buckets — sync files from Amazon S3, with optional auto-sync on upload.
- GitHub repositories — sync code from public or private repos, with diff-based incremental updates.
- Web — crawl and ingest websites starting from a seed URL.
- File upload — upload individual files directly from the dashboard or via the API.
Concepts
Chroma Sync has three primary concepts: source types, sources, and invocations. A source type defines a kind of entity that can be chunked, embedded, and indexed (e.g. S3, GitHub, Web, File Upload). A source is a configured instance of a source type — for example, a specific S3 bucket with credentials and a path prefix. An invocation is one sync run over a source’s data; each invocation produces or appends to one Chroma collection.Global Source Configuration
Every source, regardless of type, is configured with a target database and an embedding configuration. Source-type-specific fields (bucket name, repository, starting URL, etc.) are documented on each source type’s page.database_nameis the Chroma database in which collections will be created. The database must already exist.embedding.dense.modelis the dense embedding model. Currently onlyQwen/Qwen3-Embedding-0.6Bis supported. Reach out to engineering@trychroma.com to request additional models.
embedding.sparse.model—Chroma/BM25orprithivida/Splade_PP_en_v1.embedding.sparse.key— metadata key under which sparse embeddings are stored.
chunking.type—tree_sitter(syntax-aware, withmax_size_bytes) orlines(line-based, withmax_linesandmax_size_bytes).
Global Invocation Configuration
Each invocation may specify a target collection:target_collection_nameis the Chroma collection to write into. The collection is created on first use, or appended to if it already exists. Required for GitHub and Web invocations; optional for S3 (defaults to the source’scollection_name); set automatically for file uploads via thecollection_nameform field. If a collection has already finished an ingest (finished_ingest=truemetadata), invocation creation returns409 Conflict.
object_key, GitHub ref_identifier, etc.) are documented on each source type’s page.
Authentication
The Sync API authenticates with a Chroma Cloud API key sent in thex-chroma-token header.