Nov 11, 2023
The goal of this doc is to align core and community efforts for the project and to share what's in store for this year!
- What is the core Chroma team working on right now?
- What will Chroma prioritize over the next 6mo?
- What areas are great for community contributions?
What is the core Chroma team working on right now?
- 🌩️ Standing up that distributed system as a managed service (aka "Hosted Chroma" - sign up for waitlist!)
What did the Chroma team just complete?
- New - Chroma 0.4 - our first production-oriented release
- 🐍 A more minimal python-client only build target
- ✋ Google PaLM embedding support
- 🎣 OpenAI ChatGPT Retrieval Plugin
What will Chroma prioritize over the next 6mo?
Next Milestone: ☁️ Launch Hosted Chroma
Areas we will invest in
Not an exhaustive list, but these are some of the core team’s biggest priorities over the coming few months. Use caution when contributing in these areas and please check-in with the core team first.
- ⏩ Workflow: Building tools for answer questions like: what embedding model should I use? And how should I chunk up my documents?
- 🌌 Visualization: Building visualization tool to give developers greater intuition embedding spaces
- 🔀 Query Planner: Building tools to enable per-query and post-query transforms
- 🔧 Developer experience: Extending Chroma into a CLI
- 📦 Easier Data Sharing: Working on formats for serialization and easier data sharing of embedding Collections
- 🔍 Improving recall: Fine-tuning embedding transforms through human feedback
- 🧠 Analytical horsepower: Clustering, deduplication, classification and more
What areas are great for community contributions?
This is where you have a lot more free rein to contribute (without having to sync with us first)!
If you're unsure about your contribution idea, feel free to chat with us (@chroma) in the
#general channel in our Discord! We'd love to support you however we can.
⚙️ Example Templates
We can always use more integrations with the rest of the AI ecosystem. Please let us know if you're working on one and need help!
Other great starting points for Chroma (please send PRs for more here):
For those integrations we do have, like
LlamaIndex, we do always want more tutorials, demos, workshops, videos, and podcasts (we've done some pods on our blog).
📦 Example Datasets
It doesn’t make sense for developers to embed the same information over and over again with the same embedding model.
We'd like suggestions for:
- "small" (<100 rows)
- "medium" (<5MB)
- "large" (>1GB)
datasets for people to stress test Chroma in a variety of scenarios.
⚖️ Embeddings Comparison
Chroma does ship with Sentence Transformers by default for embeddings, but we are otherwise unopinionated about what embeddings you use. Having a library of information that has been embedded with many models, alongside example query sets would make it much easier for empirical work to be done on the effectiveness of various models across different domains.
- Preliminary reading on Embeddings
- Huggingface Benchmark of a bunch of Embeddings
- notable issues with GPT3 Embeddings and alternatives to consider
⚗️ Experimental Algorithms
If you have a research background, please consider adding to our
ExperimentalAPIs. For example:
- Projections (t-sne, UMAP, the new hotness, the one you just wrote) and Lightweight visualization
- Clustering (HDBSCAN, PCA)
- Multimodal (CLIP)
- Fine-tuning manifold with human feedback eg
- Expanded vector search (MMR, Polytope)
- Your research
🧑💻️ Additional Client SDKs
We will be happy to work with people maintaining additional client SDKs as part of the community. Specifically:
You can find the REST OpenAPI spec at
localhost:8000/openapi.json when the backend is running.
Please reach out and talk to us before you get too far in your projects so that we can offer technical guidance/align on roadmap.