starling.search.search_utils

Search Utilities

Core utility classes for similarity search: score conversion, candidate representation, and extensible filters.

Overview

This module provides building blocks for the search pipeline:

  • ScoreConverter: Handles metric-specific score transformations

  • Candidate: Immutable representation of search results

  • CandidateFilter: Abstract base for custom filtering logic

  • Built-in Filters: ValidGid, L2Distance, CosineSim, Length, ExactMatch, SequenceIdentity

These utilities are used internally by SearchEngine but can also be used directly for custom search pipelines.

Score Conversion

The ScoreConverter handles conversions between FAISS raw scores and user-facing outputs:

For Cosine Similarity:

  • FAISS returns inner product scores (higher = more similar)

  • return_similarity=True: Output as-is [0, 1]

  • return_similarity=False: Convert to distance (1 - similarity)

For L2 Distance:

  • FAISS returns squared L2 distance (lower = more similar)

  • Always output as distance (no conversion)

Usage:

>>> converter = ScoreConverter(metric="cosine", return_similarity=True)
>>> output_score = converter.convert(raw_faiss_score=0.95)
>>> output_score
0.95

Candidate Representation

The Candidate dataclass provides a clean interface for search results:

Attributes:

score (float): Converted score/similarity gid (int): Global sequence ID header (str | None): Sequence header from database length (int | None): Sequence length stored_hash (int | None): 8-byte sequence hash for deduplication

Usage:

>>> candidate = Candidate(
...     score=0.95,
...     gid=12345,
...     header="sp|P12345|PROT_HUMAN",
...     length=234,
...     stored_hash=123456789
... )
>>> candidate.as_tuple()
(0.95, 12345, "sp|P12345|PROT_HUMAN", 234)

Custom Filters

Extend CandidateFilter to create custom filtering logic:

Example - Filter by minimum score:

class MinScoreFilter(CandidateFilter):
    def __init__(self, min_score: float):
        self.min_score = min_score

    def apply(self, candidate: Candidate, query_seq: str = None) -> bool:
        return candidate.score >= self.min_score

    def get_name(self) -> str:
        return "min_score"

Built-in Filters

ValidGidFilter

Filters out invalid GIDs (< 0). Always active in search pipeline.

Usage:

filter = ValidGidFilter()
passes = filter.apply(candidate)  # False if gid < 0
L2DistanceFilter

Filters by minimum L2 distance (for L2 metric).

Parameters:

min_distance (float): Minimum distance threshold

Usage:

filter = L2DistanceFilter(min_distance=0.5)
passes = filter.apply(candidate)  # True if distance >= 0.5
CosineSimFilter

Filters by maximum cosine similarity (for cosine metric).

Parameters:

max_similarity (float): Maximum similarity threshold return_similarity (bool): Whether scores are similarities or distances

Usage:

filter = CosineSimFilter(max_similarity=0.99, return_similarity=True)
passes = filter.apply(candidate)  # True if similarity <= 0.99
LengthFilter

Filters by sequence length range.

Parameters:

min_len (int | None): Minimum length (inclusive) max_len (int | None): Maximum length (inclusive)

Usage:

filter = LengthFilter(min_len=50, max_len=500)
passes = filter.apply(candidate)  # True if 50 <= length <= 500
ExactMatchFilter

Filters out exact sequence matches using hash + full comparison.

Parameters:

query_hash (int): Hash of query sequence seq_store (SequenceStore): Database for sequence lookup

Usage:

query_hash = SequenceStore.hash8(query_seq)
filter = ExactMatchFilter(query_hash, seq_store)
passes = filter.apply(candidate, query_seq)  # False if exact match
SequenceIdentityFilter

Filters by maximum sequence identity.

Parameters:

max_identity (float): Maximum identity threshold (0-1) denominator (str): Identity denominator (“query”, “target”, “min”, “max”, “avg”) seq_store (SequenceStore): Database for sequence lookup identity_func (callable): Function computing identity

Usage:

def compute_identity(seq1, seq2, denom="query"):
    # Your alignment logic here
    return identity_score

filter = SequenceIdentityFilter(
    max_identity=0.95,
    denominator="query",
    seq_store=seq_store,
    identity_func=compute_identity
)
passes = filter.apply(candidate, query_seq)  # True if identity < 0.95

Filter Pipeline

Filters are applied sequentially in SearchEngine. First failed filter stops evaluation:

Pipeline order:
  1. ValidGidFilter (always first)

  2. L2DistanceFilter or CosineSimFilter (embedding-level)

  3. LengthFilter (metadata-level)

  4. ExactMatchFilter (sequence-level, per-query)

  5. SequenceIdentityFilter (alignment-level, per-query)

This ordering minimizes expensive operations (sequence fetches, alignments).

Optimization Tips:

  1. Use length_min/max to pre-filter via SQL index (much faster than post-filter)

  2. Place cheap filters before expensive ones

  3. Use hash comparison before full sequence comparison

  4. Consider overfetch parameter when using aggressive filters

Integration with SearchEngine

SearchEngine automatically builds and applies filters based on search parameters:

results = engine.search(
    queries=queries,
    k=100,
    nprobe=128,
    # These parameters create filters internally:
    length_min=50,              # -> LengthFilter
    length_max=500,             # -> LengthFilter
    max_cosine_similarity=0.99, # -> CosineSimFilter
    exclude_exact_sequence=True,# -> ExactMatchFilter (per-query)
    sequence_identity_max=0.95  # -> SequenceIdentityFilter (per-query)
)

See also

  • SearchEngine: Main search interface using these utilities

  • SequenceStore: Database for sequence lookups in filters

  • starling.search: Main search module

Classes

Candidate

Represents a search result candidate.

CandidateFilter

Base class for candidate filters.

CosineSimFilter

Filter by maximum cosine similarity.

ExactMatchFilter

Filter out exact sequence matches.

L2DistanceFilter

Filter by minimum L2 distance.

LengthFilter

Filter by sequence length range.

ScoreConverter

Handles score/similarity conversion for different metrics.

SequenceIdentityFilter

Filter by maximum sequence identity.

ValidGidFilter

Filter out invalid GIDs.