Similarity Search
STARLING pairs its ensemble-aware sequence encoder with FAISS to deliver fast
nearest-neighbour searches over millions of protein sequences. Use this guide to
build a reusable index, embed queries, and interpret the rich filtering options
exposed by starling.search.SearchEngine.
Prerequisites
Install the
search-gpuextra (pip install idptools-starling[search-gpu]) when working with CUDA-enabled FAISS; the standard installation already includesfaiss-cpu.Prepare tokenised sequence shards with
starling-pretokenizeor your own data pipeline. Each shard stores per-residue latent codes used during indexing.The first invocation of
starling-search query(orSearchEngine.load()withindex_path="default") downloads the pre-built UniRef50 reference index. The archive is cached under~/.starling_searchso subsequent runs reuse the local copy without additional downloads.
Building indexes
from starling.search import IndexBuilder
builder = IndexBuilder(
root="/data/starling_corpus",
metric="cosine",
verbose=True,
shard_id_regex=r"shard_(\d+)\.h5",
)
builder.build_index(
index_path="/data/indexes/uniref50.faiss",
tokens_dir="/data/starling_corpus/tokens",
sample_size=1_000_000,
nlist=32768,
m=64,
nbits=8,
use_gpu=True,
gpu_fp16_lut=True,
compress_sequences=True,
)
IndexBuilder writes a FAISS index alongside a SQLite-backed
SequenceStore that retains the original headers,
lengths, and (optionally) sequences for reranking. For a shell-first workflow
use the bundled CLI:
starling-search build \
--root /data/starling_corpus \
--tokens /data/starling_corpus/tokens \
--index /data/indexes/uniref50.faiss \
--sample-size 1000000 \
--nlist 32768 --m 64 --nbits 8 \
--use-gpu --gpu-device 0 --opq
Querying from Python
import torch
from starling.search import SearchEngine
from starling.inference.generation import sequence_encoder_backend
index_path = "/data/indexes/uniref50.faiss"
engine = SearchEngine.load(index_path, metric="cosine", verbose=True)
sequences = {
"hnrnpa1": "GGRSGRGGGFGGGGGGGGGY...",
"synuclein": "MDVFMKGLSKAKEGVVAAAEKTKQGVAE...",
}
embeddings = sequence_encoder_backend(
sequence_dict=sequences,
device="cuda:0",
batch_size=64,
ionic_strength=150,
aggregate=True,
output_directory=None,
)
queries = torch.stack([embeddings[name] for name in sequences]).float()
queries = torch.nn.functional.normalize(queries, dim=1)
results = engine.search(
queries=queries,
k=50,
nprobe=128,
return_similarity=True,
query_sequences=list(sequences.values()),
exclude_exact=True,
length_min=50,
length_max=600,
max_cosine_similarity=0.95,
rerank=True,
rerank_device="cuda:0",
)
Each row in results is a list of tuples
(score, gid, header, length). When return_similarity is True and the
metric is cosine, score is the similarity value; otherwise it contains the
raw FAISS distance.
Filtering and reranking
The search engine composes several filter primitives:
Length gating - quickly discard candidates outside a residue range before reranking.
Exact match suppression - exclude entries that exactly match the input sequence when
exclude_exact=True.Identity threshold - use
sequence_identity_maxwithidentity_denominator("query","target","max","min", or"avg") to control how similarity is computed.Metric clamps -
max_cosine_similarityormin_l2_distanceallow coarse filtering straight out of FAISS.Reranking - set
rerank=Trueto re-embed the top candidates with the full encoder for exact scoring.rerank_deviceandrerank_batch_sizemanage resources, whilererank_ionic_strengthlets you score under different ionic-strength conditions.
Command-line querying
starling-search query \
--index /data/indexes/uniref50.faiss \
--seq MQDRVKRPMNAFIVWSRDQRRKMALENPRMRNSEISKQLGYQWKM \
--seq MDVFMKGLSKAKEGVVAAAEKTKQGVAEAAGKT \
--k 20 --nprobe 128 \
--exclude-exact --sequence-identity-max 0.9 \
--length-min 40 --length-max 800 \
--rerank --rerank-device cuda:0 \
--out search_results.csv
Results can be saved as CSV/JSONL and optionally exported to FASTA for
inspection. Passing --index default downloads the reference STARLING index
if available and caches it locally.
Next steps
Sequence Embeddings - extract embeddings for downstream analysis or custom similarity metrics.
Ensemble Generation - generate new ensembles for promising hits.
Possible Issues and Solutions - diagnose FAISS installation or GPU related problems.