Skip to main content
Chroma Sync exposes endpoints for developers to chunk, embed, and index various data sources. The API is intended for Chroma Cloud users and can be accessed for free (up to $5 in credits) by creating a Chroma Cloud account.

Key Concepts

Chroma Sync has three primary concepts: source types, sources and invocations.

Source Types

A source type defines a kind of entity that contains data that can be chunked, embedded, and indexed. Each source type defines its own schema for configuring sources of its type. Chroma Sync currently supports three source types: S3 buckets, GitHub repositories, and web scraping. If there is a specific source type for which you would like support, please reach out to engineering@trychroma.com.

S3

The S3 source type allows developers to sync files from Amazon S3 buckets into Chroma. It supports documents (PDFs, Office files, images, ebooks), code, and plain text. S3 sources can be configured with auto-sync to automatically index files as they are uploaded to S3. For a detailed walkthrough, see S3 Sync docs.

GitHub Repositories

The GitHub repository source type allows developers to sync code in public and private GitHub repositories. Public repositories require no setup other than creating a Chroma Cloud account and issuing an API key. Chroma Sync for private repositories is available at two different tiers: direct and platform.

Direct Sync

The direct tier requires you to install Chroma’s GitHub App into any repository for which you wish to perform syncing. The direct tier is only available via the Chroma Cloud UI and does not enable you to perform Sync-related operations via the API. This tier is ideal for developers who wish to sync private repositories that they own. If you are interested in using the direct tier via API, please reach out to us at engineering@trychroma.com.

Platform Sync

The platform tier requires you to grant Chroma access to a GitHub App that you own, which has been installed into the private repositories you wish to sync. This GitHub App must have read-only access to the “Contents” and “Metadata” permissions on the list of “Repository permissions”. The platform tier grants access to the Chroma Sync API and is ideal for companies and organizations that offer services which access their users’ codebases. For a detailed walkthrough, see Platform Sync docs.

Web

The web source type allows developers to scrape the contents of web pages into Chroma. Given a starting URL, Sync will crawl the page and its links up to a specified depth.

Sources

A source is a specific instance of a source type configured according to the global and source type-specific configuration schema. The global source configuration schema refers to the configuration parameters that are required across sources of all types, while the source-type specific configuration schema refers to the configuration parameters required for a specific source type. The global source configuration schema requires the following parameters:
{
  "database_name": "string",
  "embedding": {
    "dense": {
        "model": "Qwen/Qwen3-Embedding-0.6B"
    }
  }
}
  • database_name defines the Chroma database in which collections should be created by invocations run on this source. A database must exist before creating sources that point to it.
  • embedding.dense.model defines the embedding model that should be used to generate dense embeddings for chunked documents. Currently, only the Qwen3-Embedding-0.6B model is supported, but if there is a model you would like to use, please let us know by reaching out to engineering@trychroma.com.
You can optionally configure sparse embeddings alongside dense embeddings:
{
  "embedding": {
    "dense": {
      "model": "Qwen/Qwen3-Embedding-0.6B"
    },
    "sparse": {
      "model": "Chroma/BM25",
      "key": "sparse_embedding"
    }
  }
}
  • embedding.sparse.model defines the sparse embedding model. Supported models: Chroma/BM25, prithivida/Splade_PP_en_v1.
  • embedding.sparse.key defines the metadata key under which sparse embeddings are stored.
You can also configure chunking behavior:
{
  "chunking": {
    "type": "tree_sitter",
    "max_size_bytes": 8192
  }
}
  • chunking.type can be tree_sitter (syntax-aware, with max_size_bytes) or lines (line-based, with max_lines and max_size_bytes).

S3

A source of the S3 type is configured with a bucket name, region, collection name, and AWS credentials:
{
    "bucket_name": "string",
    "region": "string",
    "collection_name": "string",
    "aws_credential_id": 0,
    "path_prefix": "string",
    "auto_sync": "none"
}
  • bucket_name is the name of the S3 bucket to sync from.
  • region is the AWS region of the bucket.
  • collection_name is the default target collection name for synced data.
  • aws_credential_id is the ID of AWS credentials.
  • path_prefix (optional) limits which S3 keys can be synced. Only keys starting with this prefix are allowed.
  • auto_sync (optional) sets the auto-sync mode: none (default), direct, or metadata. See S3 Auto-Sync.

GitHub Repositories

A source of the GitHub repository type is an individual GitHub repository configured with the global source configuration parameters, and the GitHub source-specific configuration parameters:
{
	"repository": "string",
	"app_id": "string" | null, // optional
	"include_globs": ["string", ...] | null, // optional
}
  • repository defines the GitHub repository whose code should be synced. This must be the forward slash-separated combination of the repository owner’s GitHub username and the repository name (e.g., chroma-core/chroma). Note that changing a repository name after creating a Chroma Sync source for it will result in invocations on that source failing, so a new source with the updated repository name must be created.
  • app_id defines the GitHub App ID of the GitHub App that has access to the provided repository. This parameter should only be supplied if the provided repository is private.
  • include_globs defines a set of glob patterns for which matching files should be synced. If this parameter is not provided, files matching "*" will be synced. Note that Chroma will not sync binary data, images, and other large or non-UTF-8 files.

Web

A source of the web type is configured with a starting URL and a few other optional parameters:
{
    "starting_url": "https://docs.trychroma.com",
    // all below are optional
    "page_limit": 5,
    "include_path_regexes": ["/cloud/*"],
    "exclude_path_regexes": ["/blog/*"],
    "max_depth": 2
}

Invocations

Invocations refer to runs of the Sync Function over the data in a source. One invocation corresponds to one sync pass through all of the data in a source. A single invocation will result in the creation of exactly one collection in the database specified by the invocation’s source. This collection will contain the chunked, embedded, and indexed data that represents the state of the source at the time of the invocation’s creation. Invocations, like sources, have some global configuration parameters, as well as parameters specific to the type of the source for which the invocation is being run. The global invocation configuration parameters are:
{
	"target_collection_name": "string"
}
  • target_collection_name defines the name of the Chroma collection in which synced data should be stored. This is required for GitHub and Web sources. For S3 sources, it is optional and defaults to the collection_name configured on the source. The target must be a collection that does not already exist with synced data. Chroma Sync uses the metadata key finished_ingest to indicate whether a collection contains synced data. If an invocation creation request is received for a collection with metadata in which this key is present and set to true, the API will return a 409 Conflict.

S3

Invocations on sources of the S3 type sync individual files from the bucket. The configuration parameters specific to S3 invocations are:
{
    "object_key": "string",
    "custom_id": "string",
    "metadata": {},
    "target_collection_name": "string"
}
  • object_key (required) is the full S3 object key to sync. Must include the path_prefix if one is configured on the source.
  • custom_id (optional) is a custom document ID (max 120 bytes). Chunk IDs become custom_id-{chunk} instead of sha256(object_key)-{chunk}.
  • metadata (optional) is additional metadata merged with standard chunk metadata. Values must be scalars (string, number, boolean, or null).
  • target_collection_name (optional) overrides the source’s collection_name. If not provided, defaults to the collection_name configured on the source.

GitHub Repositories

Invocations on sources of the GitHub repository type are sync runs over an individual GitHub repository with some set of configuration parameters. The configuration parameters that are specific to invocations on sources of this type are:
{
	"ref_identifier": {
		"$oneOf": {
			"branch": "string",
			"sha": "string"
		}
	}
}
  • ref_identifier is either the commit SHA-256 or the name of the branch from which to retrieve the code to be synced. If a branch is provided, the code will be retrieved from the branch’s latest commit.
For all API endpoints, see the Sync API Reference.