Key Concepts
Chroma Sync has three primary concepts: source types, sources and invocations.Source Types
A source type defines a kind of entity that contains data that can be chunked, embedded, and indexed. Each source type defines its own schema for configuring sources of its type. Chroma Sync currently supports three source types: S3 buckets, GitHub repositories, and web scraping. If there is a specific source type for which you would like support, please reach out to engineering@trychroma.com.S3
The S3 source type allows developers to sync files from Amazon S3 buckets into Chroma. It supports documents (PDFs, Office files, images, ebooks), code, and plain text. S3 sources can be configured with auto-sync to automatically index files as they are uploaded to S3. For a detailed walkthrough, see S3 Sync docs.GitHub Repositories
The GitHub repository source type allows developers to sync code in public and private GitHub repositories. Public repositories require no setup other than creating a Chroma Cloud account and issuing an API key. Chroma Sync for private repositories is available at two different tiers: direct and platform.Direct Sync
The direct tier requires you to install Chroma’s GitHub App into any repository for which you wish to perform syncing. The direct tier is only available via the Chroma Cloud UI and does not enable you to perform Sync-related operations via the API. This tier is ideal for developers who wish to sync private repositories that they own. If you are interested in using the direct tier via API, please reach out to us at engineering@trychroma.com.Platform Sync
The platform tier requires you to grant Chroma access to a GitHub App that you own, which has been installed into the private repositories you wish to sync. This GitHub App must have read-only access to the “Contents” and “Metadata” permissions on the list of “Repository permissions”. The platform tier grants access to the Chroma Sync API and is ideal for companies and organizations that offer services which access their users’ codebases. For a detailed walkthrough, see Platform Sync docs.Web
The web source type allows developers to scrape the contents of web pages into Chroma. Given a starting URL, Sync will crawl the page and its links up to a specified depth.Sources
A source is a specific instance of a source type configured according to the global and source type-specific configuration schema. The global source configuration schema refers to the configuration parameters that are required across sources of all types, while the source-type specific configuration schema refers to the configuration parameters required for a specific source type. The global source configuration schema requires the following parameters:database_namedefines the Chroma database in which collections should be created by invocations run on this source. A database must exist before creating sources that point to it.embedding.dense.modeldefines the embedding model that should be used to generate dense embeddings for chunked documents. Currently, only the Qwen3-Embedding-0.6B model is supported, but if there is a model you would like to use, please let us know by reaching out to engineering@trychroma.com.
embedding.sparse.modeldefines the sparse embedding model. Supported models:Chroma/BM25,prithivida/Splade_PP_en_v1.embedding.sparse.keydefines the metadata key under which sparse embeddings are stored.
chunking.typecan betree_sitter(syntax-aware, withmax_size_bytes) orlines(line-based, withmax_linesandmax_size_bytes).
S3
A source of the S3 type is configured with a bucket name, region, collection name, and AWS credentials:bucket_nameis the name of the S3 bucket to sync from.regionis the AWS region of the bucket.collection_nameis the default target collection name for synced data.aws_credential_idis the ID of AWS credentials.path_prefix(optional) limits which S3 keys can be synced. Only keys starting with this prefix are allowed.auto_sync(optional) sets the auto-sync mode:none(default),direct, ormetadata. See S3 Auto-Sync.
GitHub Repositories
A source of the GitHub repository type is an individual GitHub repository configured with the global source configuration parameters, and the GitHub source-specific configuration parameters:repositorydefines the GitHub repository whose code should be synced. This must be the forward slash-separated combination of the repository owner’s GitHub username and the repository name (e.g.,chroma-core/chroma). Note that changing a repository name after creating a Chroma Sync source for it will result in invocations on that source failing, so a new source with the updated repository name must be created.app_iddefines the GitHub App ID of the GitHub App that has access to the providedrepository. This parameter should only be supplied if the provided repository is private.include_globsdefines a set of glob patterns for which matching files should be synced. If this parameter is not provided, files matching"*"will be synced. Note that Chroma will not sync binary data, images, and other large or non-UTF-8 files.
Web
A source of the web type is configured with a starting URL and a few other optional parameters:Invocations
Invocations refer to runs of the Sync Function over the data in a source. One invocation corresponds to one sync pass through all of the data in a source. A single invocation will result in the creation of exactly one collection in the database specified by the invocation’s source. This collection will contain the chunked, embedded, and indexed data that represents the state of the source at the time of the invocation’s creation. Invocations, like sources, have some global configuration parameters, as well as parameters specific to the type of the source for which the invocation is being run. The global invocation configuration parameters are:target_collection_namedefines the name of the Chroma collection in which synced data should be stored. This is required for GitHub and Web sources. For S3 sources, it is optional and defaults to thecollection_nameconfigured on the source. The target must be a collection that does not already exist with synced data. Chroma Sync uses the metadata keyfinished_ingestto indicate whether a collection contains synced data. If an invocation creation request is received for a collection with metadata in which this key is present and set to true, the API will return a 409 Conflict.
S3
Invocations on sources of the S3 type sync individual files from the bucket. The configuration parameters specific to S3 invocations are:object_key(required) is the full S3 object key to sync. Must include thepath_prefixif one is configured on the source.custom_id(optional) is a custom document ID (max 120 bytes). Chunk IDs becomecustom_id-{chunk}instead ofsha256(object_key)-{chunk}.metadata(optional) is additional metadata merged with standard chunk metadata. Values must be scalars (string, number, boolean, or null).target_collection_name(optional) overrides the source’scollection_name. If not provided, defaults to thecollection_nameconfigured on the source.
GitHub Repositories
Invocations on sources of the GitHub repository type are sync runs over an individual GitHub repository with some set of configuration parameters. The configuration parameters that are specific to invocations on sources of this type are:ref_identifieris either the commit SHA-256 or the name of the branch from which to retrieve the code to be synced. If a branch is provided, the code will be retrieved from the branch’s latest commit.