Filtering with Where
Learn how to filter search results using Where expressions and the Key/K class to narrow down your search to specific documents, IDs, or metadata values.
The Key/K Class#
The Key class (aliased as K for brevity) provides a fluent interface for building filter expressions. Use K to reference document fields, IDs, and metadata properties.
from chromadb import K
# K is an alias for Key - use K for more concise code
# Filter by metadata field
K("status") == "active"
# Filter by document content
K.DOCUMENT.contains("machine learning")
# Filter by document IDs
K.ID.is_in(["doc1", "doc2", "doc3"])
Filterable Fields#
| Field | Usage | Description |
|---|---|---|
| K.ID | K.ID.is_in(["id1", "id2"]) | Filter by document IDs |
| K.DOCUMENT | K.DOCUMENT.contains("text") | Filter by document content |
| K("field_name") | K("status") == "active" | Filter by any metadata field |
Comparison Operators#
Supported operators:
- == - Equality (all types: string, numeric, boolean)
- != - Inequality (all types: string, numeric, boolean)
- > - Greater than (numeric only)
- >= - Greater than or equal (numeric only)
- < - Less than (numeric only)
- <= - Less than or equal (numeric only)
# Equality and inequality (all types)
K("status") == "published" # String equality
K("views") != 0 # Numeric inequality
K("featured") == True # Boolean equality
# Numeric comparisons (numbers only)
K("price") > 100 # Greater than
K("rating") >= 4.5 # Greater than or equal
K("stock") < 10 # Less than
K("discount") <= 0.25 # Less than or equal
Chroma supports three data types for metadata: strings, numbers (int/float), and booleans. Order comparison operators (>, <, >=, <=) currently only work with numeric types.
Set and String Operators#
Supported operators:
- is_in() - Value matches any in the list
- not_in() - Value doesn't match any in the list
- contains() - String contains substring (case-sensitive, currently K.DOCUMENT only)
- not_contains() - String doesn't contain substring (currently K.DOCUMENT only)
- regex() - String matches regex pattern (currently K.DOCUMENT only)
- not_regex() - String doesn't match regex pattern (currently K.DOCUMENT only)
# Set membership operators (works on all fields)
K.ID.is_in(["doc1", "doc2", "doc3"]) # Match any ID in list
K("category").is_in(["tech", "science"]) # Match any category
K("status").not_in(["draft", "deleted"]) # Exclude specific values
# String content operators (currently K.DOCUMENT only)
K.DOCUMENT.contains("machine learning") # Substring search in document
K.DOCUMENT.not_contains("deprecated") # Exclude documents with text
K.DOCUMENT.regex(r"\bAPI\b") # Match whole word "API" in document
# Note: String pattern matching on metadata fields not yet supported
# K("title").contains("Python") # NOT YET SUPPORTED
# K("email").regex(r".*@company\.com$") # NOT YET SUPPORTED
String operations like contains() and regex() are case-sensitive by default. The is_in() operator is efficient even with large lists.
Logical Operators#
Supported operators:
- & - Logical AND (all conditions must match)
- | - Logical OR (any condition can match)
Combine multiple conditions using these operators. Always use parentheses to ensure correct precedence.
# AND operator (&) - all conditions must match
(K("status") == "published") & (K("year") >= 2020)
# OR operator (|) - any condition can match
(K("category") == "tech") | (K("category") == "science")
# Combining with document and ID filters
(K.DOCUMENT.contains("AI")) & (K("author") == "Smith")
(K.ID.is_in(["id1", "id2"])) | (K("featured") == True)
# Complex nesting - use parentheses for clarity
(
(K("status") == "published") &
((K("category") == "tech") | (K("category") == "science")) &
(K("rating") >= 4.0)
)
Always use parentheses around each condition when using logical operators. Python's operator precedence may not work as expected without them.
Dictionary Syntax (MongoDB-style)#
You can also use dictionary syntax instead of K expressions. This is useful when building filters programmatically.
Supported dictionary operators:
- Direct value - Shorthand for equality
- $eq - Equality
- $ne - Not equal
- $gt - Greater than (numeric only)
- $gte - Greater than or equal (numeric only)
- $lt - Less than (numeric only)
- $lte - Less than or equal (numeric only)
- $in - Value in list
- $nin - Value not in list
- $contains - String contains
- $not_contains - String doesn't contain
- $regex - Regex match
- $not_regex - Regex doesn't match
- $and - Logical AND
- $or - Logical OR
# Direct equality (shorthand)
{"status": "active"} # Same as K("status") == "active"
# Comparison operators
{"status": {"$eq": "published"}} # Same as K("status") == "published"
{"count": {"$ne": 0}} # Same as K("count") != 0
{"price": {"$gt": 100}} # Same as K("price") > 100 (numbers only)
{"rating": {"$gte": 4.5}} # Same as K("rating") >= 4.5 (numbers only)
{"stock": {"$lt": 10}} # Same as K("stock") < 10 (numbers only)
{"discount": {"$lte": 0.25}} # Same as K("discount") <= 0.25 (numbers only)
# Set membership operators
{"#id": {"$in": ["id1", "id2"]}} # Same as K.ID.is_in(["id1", "id2"])
{"category": {"$in": ["tech", "ai"]}} # Same as K("category").is_in(["tech", "ai"])
{"status": {"$nin": ["draft", "deleted"]}} # Same as K("status").not_in(["draft", "deleted"])
# String operators (currently K.DOCUMENT only)
{"#document": {"$contains": "API"}} # Same as K.DOCUMENT.contains("API")
# {"title": {"$not_contains": "draft"}} # Not yet supported - metadata fields
# {"email": {"$regex": ".*@example\\.com"}} # Not yet supported - metadata fields
# {"version": {"$not_regex": "^beta"}} # Not yet supported - metadata fields
# Logical operators
{"$and": [
{"status": "published"},
{"year": {"$gte": 2020}},
{"#document": {"$contains": "AI"}}
]} # Combines multiple conditions with AND
{"$or": [
{"category": "tech"},
{"category": "science"},
{"featured": True}
]} # Combines multiple conditions with OR
# Complex nested example
{
"$and": [
{"$or": [
{"category": "tech"},
{"category": "science"}
]},
{"status": "published"},
{"quality_score": {"$gte": 0.8}}
]
}
Each dictionary can only contain one field or one logical operator ($and/$or). For field dictionaries, only one operator is allowed per field.
Common Filtering Patterns#
# Filter by specific document IDs
search = Search().where(K.ID.is_in(["doc_001", "doc_002", "doc_003"]))
# Exclude already processed documents
processed_ids = ["doc_100", "doc_101"]
search = Search().where(K.ID.not_in(processed_ids))
# Full-text search in documents
search = Search().where(K.DOCUMENT.contains("quantum computing"))
# Combine document search with metadata
search = Search().where(
K.DOCUMENT.contains("machine learning") &
(K("language") == "en")
)
# Price range filtering
search = Search().where(
(K("price") >= 100) &
(K("price") <= 500)
)
# Multi-field filtering
search = Search().where(
(K("status") == "active") &
(K("category").is_in(["tech", "ai", "ml"])) &
(K("score") >= 0.8)
)
Edge Cases and Important Behavior#
Missing Keys
When filtering on a metadata field that doesn't exist for a document:
- Most operators (==, >, <, >=, <=, is_in()) evaluate to false - the document won't match
- != evaluates to true - documents without the field are considered "not equal" to any value
- not_in() evaluates to true - documents without the field are not in any list
# If a document doesn't have a "category" field:
K("category") == "tech" # false - won't match
K("category") != "tech" # true - will match
K("category").is_in(["tech"]) # false - won't match
K("category").not_in(["tech"]) # true - will match
Mixed Types
Avoid storing different data types under the same metadata key across documents. Query behavior is undefined when comparing values of different types.
# DON'T DO THIS - undefined behavior
# Document 1: {"score": 95} (numeric)
# Document 2: {"score": "95"} (string)
# Document 3: {"score": true} (boolean)
K("score") > 90 # Undefined results when mixed types exist
# DO THIS - consistent types
# All documents: {"score": <numeric>} or all {"score": <string>}
String Pattern Matching Limitations
Currently, contains(), not_contains(), regex(), and not_regex() operators only work on K.DOCUMENT. These operators do not yet support metadata fields.
Additionally, the pattern must contain at least 3 literal characters to ensure accurate results.
# Currently supported - K.DOCUMENT only
K.DOCUMENT.contains("API") # ✓ Works
K.DOCUMENT.regex(r"v\d\.\d\.\d") # ✓ Works
K.DOCUMENT.contains("machine learning") # ✓ Works
# NOT YET SUPPORTED - metadata fields
K("title").contains("Python") # ✗ Not supported yet
K("description").regex(r"API.*") # ✗ Not supported yet
# Pattern length requirements (for K.DOCUMENT)
K.DOCUMENT.contains("API") # ✓ 3 characters - good
K.DOCUMENT.contains("AI") # ✗ Only 2 characters - may give incorrect results
K.DOCUMENT.regex(r"\d+") # ✗ No literal characters - may give incorrect results
String pattern matching currently only works on K.DOCUMENT. Support for metadata fields is not yet available. Also, patterns with fewer than 3 literal characters may return incorrect results.
String pattern matching on metadata fields is not currently supported. Full support is coming in a future release, which will allow users to opt-in to additional indexes for string pattern matching on specific metadata fields.
Complete Example#
Here's a practical example combining different filter types:
from chromadb import Search, K, Knn
# Complex filter combining IDs, document content, and metadata
search = (Search()
.where(
# Exclude specific documents
K.ID.not_in(["excluded_001", "excluded_002"]) &
# Must contain specific content
K.DOCUMENT.contains("artificial intelligence") &
# Metadata conditions
(K("status") == "published") &
(K("quality_score") >= 0.75) &
(
(K("category") == "research") |
(K("category") == "tutorial")
) &
(K("year") >= 2023)
)
.rank(Knn(query="latest AI research developments"))
.limit(10)
.select(K.DOCUMENT, "title", "author", "year")
)
results = collection.search(search)
Tips and Best Practices#
- Use parentheses liberally when combining conditions with & and | to avoid precedence issues
- Filter before ranking when possible to reduce the number of vectors to score
- Be specific with ID filters - using K.ID.is_in() with a small list is very efficient
- String matching is case-sensitive - normalize your data if case-insensitive matching is needed
- Use the right operator - is_in() for multiple exact matches, contains() for substring search
Next Steps#
- Learn about ranking and scoring to order your filtered results
- See practical examples of filtering in real-world scenarios
- Explore batch operations for running multiple filtered searches