> ## Documentation Index
> Fetch the complete documentation index at: https://docs.trychroma.com/llms.txt
> Use this file to discover all available pages before exploring further.

# Chroma BM25

export const Callout = ({title, children}) => <div className="my-6">
    <div className="relative pr-1.5 pb-1.5">
      <div className="absolute top-1.5 left-1.5 right-0 bottom-0 bg-blue-500 dark:bg-blue-600" />
      <div className="relative border border-black dark:border-gray-500 px-5 py-4 bg-white dark:bg-neutral-900">
        {title && <p className="block mb-2"><strong>{title}</strong></p>}
        {children}
      </div>
    </div>
  </div>;

Chroma provides a built-in BM25 sparse embedding function. BM25 (Best Matching 25) is a ranking function used to estimate the relevance of documents to a given search query. This embedding function runs locally and does not require any external API keys.

Sparse embeddings are useful for retrieval tasks where you want to match on specific keywords or terms, rather than semantic similarity.

<Tabs>
  <Tab title="Python" icon="python">
    This embedding function uses [snowballstemmer](https://pypi.org/project/snowballstemmer/)
    to tokenize documents.

    ```bash theme={null}
    pip install snowballstemmer
    ```

    ```python theme={null}
    from chromadb.utils.embedding_functions import ChromaBm25EmbeddingFunction

    bm25_ef = ChromaBm25EmbeddingFunction(
        k=1.2,
        b=0.75,
        avg_doc_length=256.0,
        token_max_length=40
    )

    texts = ["Hello, world!", "How are you?"]
    sparse_embeddings = bm25_ef(texts)
    ```

    You can customize the BM25 parameters:

    * `k`: Controls term frequency saturation (default: 1.2)
    * `b`: Controls document length normalization (default: 0.75)
    * `avg_doc_length`: Average document length in tokens (default: 256.0)
    * `token_max_length`: Maximum token length (default: 40)
    * `stopwords`: Optional list of stopwords to exclude
  </Tab>

  <Tab title="TypeScript" icon="js">
    ```typescript theme={null}
    // npm install @chroma-core/chroma-bm25

    import { ChromaBm25EmbeddingFunction } from "@chroma-core/chroma-bm25";

    const embedder = new ChromaBm25EmbeddingFunction({
      k: 1.2,
      b: 0.75,
      avgDocLength: 256.0,
      tokenMaxLength: 40,
    });

    // use directly
    const sparseEmbeddings = await embedder.generate(["document1", "document2"]);
    ```

    You can customize the BM25 parameters:

    * `k`: Controls term frequency saturation (default: 1.2)
    * `b`: Controls document length normalization (default: 0.75)
    * `avgDocLength`: Average document length in tokens (default: 256.0)
    * `tokenMaxLength`: Maximum token length (default: 40)
    * `stopwords`: Optional list of stopwords to exclude
  </Tab>

  <Tab title="Rust" icon="rust">
    Use the built-in BM25 sparse embedding helper, then pass embeddings to Chroma.

    ```rust theme={null}
    use chroma::embed::bm25::BM25SparseEmbeddingFunction;

    let bm25 = BM25SparseEmbeddingFunction::default_murmur3_abs();
    let sparse_vector = bm25.encode("document text")?;
    ```
  </Tab>
</Tabs>
