starling.search.search_utils
Search Utilities
Core utility classes for similarity search: score conversion, candidate representation, and extensible filters.
Overview
This module provides building blocks for the search pipeline:
ScoreConverter: Handles metric-specific score transformations
Candidate: Immutable representation of search results
CandidateFilter: Abstract base for custom filtering logic
Built-in Filters: ValidGid, L2Distance, CosineSim, Length, ExactMatch, SequenceIdentity
These utilities are used internally by SearchEngine but can also be used directly for custom search pipelines.
Score Conversion
The ScoreConverter handles conversions between FAISS raw scores and user-facing outputs:
For Cosine Similarity:
FAISS returns inner product scores (higher = more similar)
return_similarity=True: Output as-is [0, 1]return_similarity=False: Convert to distance (1 - similarity)
For L2 Distance:
FAISS returns squared L2 distance (lower = more similar)
Always output as distance (no conversion)
Usage:
>>> converter = ScoreConverter(metric="cosine", return_similarity=True)
>>> output_score = converter.convert(raw_faiss_score=0.95)
>>> output_score
0.95
Candidate Representation
The Candidate dataclass provides a clean interface for search results:
- Attributes:
score (float): Converted score/similarity gid (int): Global sequence ID header (str | None): Sequence header from database length (int | None): Sequence length stored_hash (int | None): 8-byte sequence hash for deduplication
Usage:
>>> candidate = Candidate(
... score=0.95,
... gid=12345,
... header="sp|P12345|PROT_HUMAN",
... length=234,
... stored_hash=123456789
... )
>>> candidate.as_tuple()
(0.95, 12345, "sp|P12345|PROT_HUMAN", 234)
Custom Filters
Extend CandidateFilter to create custom filtering logic:
Example - Filter by minimum score:
class MinScoreFilter(CandidateFilter):
def __init__(self, min_score: float):
self.min_score = min_score
def apply(self, candidate: Candidate, query_seq: str = None) -> bool:
return candidate.score >= self.min_score
def get_name(self) -> str:
return "min_score"
Built-in Filters
- ValidGidFilter
Filters out invalid GIDs (< 0). Always active in search pipeline.
Usage:
filter = ValidGidFilter() passes = filter.apply(candidate) # False if gid < 0
- L2DistanceFilter
Filters by minimum L2 distance (for L2 metric).
- Parameters:
min_distance (float): Minimum distance threshold
Usage:
filter = L2DistanceFilter(min_distance=0.5) passes = filter.apply(candidate) # True if distance >= 0.5
- CosineSimFilter
Filters by maximum cosine similarity (for cosine metric).
- Parameters:
max_similarity (float): Maximum similarity threshold return_similarity (bool): Whether scores are similarities or distances
Usage:
filter = CosineSimFilter(max_similarity=0.99, return_similarity=True) passes = filter.apply(candidate) # True if similarity <= 0.99
- LengthFilter
Filters by sequence length range.
- Parameters:
min_len (int | None): Minimum length (inclusive) max_len (int | None): Maximum length (inclusive)
Usage:
filter = LengthFilter(min_len=50, max_len=500) passes = filter.apply(candidate) # True if 50 <= length <= 500
- ExactMatchFilter
Filters out exact sequence matches using hash + full comparison.
- Parameters:
query_hash (int): Hash of query sequence seq_store (SequenceStore): Database for sequence lookup
Usage:
query_hash = SequenceStore.hash8(query_seq) filter = ExactMatchFilter(query_hash, seq_store) passes = filter.apply(candidate, query_seq) # False if exact match
- SequenceIdentityFilter
Filters by maximum sequence identity.
- Parameters:
max_identity (float): Maximum identity threshold (0-1) denominator (str): Identity denominator (“query”, “target”, “min”, “max”, “avg”) seq_store (SequenceStore): Database for sequence lookup identity_func (callable): Function computing identity
Usage:
def compute_identity(seq1, seq2, denom="query"): # Your alignment logic here return identity_score filter = SequenceIdentityFilter( max_identity=0.95, denominator="query", seq_store=seq_store, identity_func=compute_identity ) passes = filter.apply(candidate, query_seq) # True if identity < 0.95
Filter Pipeline
Filters are applied sequentially in SearchEngine. First failed filter stops evaluation:
- Pipeline order:
ValidGidFilter (always first)
L2DistanceFilter or CosineSimFilter (embedding-level)
LengthFilter (metadata-level)
ExactMatchFilter (sequence-level, per-query)
SequenceIdentityFilter (alignment-level, per-query)
This ordering minimizes expensive operations (sequence fetches, alignments).
Optimization Tips:
Use length_min/max to pre-filter via SQL index (much faster than post-filter)
Place cheap filters before expensive ones
Use hash comparison before full sequence comparison
Consider overfetch parameter when using aggressive filters
Integration with SearchEngine
SearchEngine automatically builds and applies filters based on search parameters:
results = engine.search(
queries=queries,
k=100,
nprobe=128,
# These parameters create filters internally:
length_min=50, # -> LengthFilter
length_max=500, # -> LengthFilter
max_cosine_similarity=0.99, # -> CosineSimFilter
exclude_exact_sequence=True,# -> ExactMatchFilter (per-query)
sequence_identity_max=0.95 # -> SequenceIdentityFilter (per-query)
)
See also
SearchEngine: Main search interface using these utilitiesSequenceStore: Database for sequence lookups in filtersstarling.search: Main search module
Classes
Represents a search result candidate. |
|
Base class for candidate filters. |
|
Filter by maximum cosine similarity. |
|
Filter out exact sequence matches. |
|
Filter by minimum L2 distance. |
|
Filter by sequence length range. |
|
Handles score/similarity conversion for different metrics. |
|
Filter by maximum sequence identity. |
|
Filter out invalid GIDs. |