Similarity Search
=================

STARLING pairs its ensemble-aware sequence encoder with FAISS to deliver fast
nearest-neighbour searches over millions of protein sequences. Use this guide to
build a reusable index, embed queries, and interpret the rich filtering options
exposed by :class:`starling.search.SearchEngine`.

Prerequisites
-------------

* Install the ``search-gpu`` extra (``pip install idptools-starling[search-gpu]``)
  when working with CUDA-enabled FAISS; the standard installation already
  includes ``faiss-cpu``.
* Prepare tokenised sequence shards with ``starling-pretokenize`` or your own
  data pipeline. Each shard stores per-residue latent codes used during indexing.
* The first invocation of ``starling-search query`` (or :meth:`SearchEngine.load`
  with ``index_path="default"``) downloads the pre-built UniRef50 reference
  index. The archive is cached under ``~/.starling_search`` so subsequent runs
  reuse the local copy without additional downloads.

Building indexes
----------------

.. code-block:: python

   from starling.search import IndexBuilder

   builder = IndexBuilder(
       root="/data/starling_corpus",
       metric="cosine",
       verbose=True,
       shard_id_regex=r"shard_(\d+)\.h5",
   )

   builder.build_index(
       index_path="/data/indexes/uniref50.faiss",
       tokens_dir="/data/starling_corpus/tokens",
       sample_size=1_000_000,
       nlist=32768,
       m=64,
       nbits=8,
       use_gpu=True,
       gpu_fp16_lut=True,
       compress_sequences=True,
   )

``IndexBuilder`` writes a FAISS index alongside a SQLite-backed
:class:`~starling.search.store.SequenceStore` that retains the original headers,
lengths, and (optionally) sequences for reranking. For a shell-first workflow
use the bundled CLI:

.. code-block:: bash

   starling-search build \
       --root /data/starling_corpus \
       --tokens /data/starling_corpus/tokens \
       --index /data/indexes/uniref50.faiss \
       --sample-size 1000000 \
       --nlist 32768 --m 64 --nbits 8 \
       --use-gpu --gpu-device 0 --opq

Querying from Python
--------------------

.. code-block:: python

   import torch
   from starling.search import SearchEngine
   from starling.inference.generation import sequence_encoder_backend

   index_path = "/data/indexes/uniref50.faiss"
   engine = SearchEngine.load(index_path, metric="cosine", verbose=True)

   sequences = {
       "hnrnpa1": "GGRSGRGGGFGGGGGGGGGY...",
       "synuclein": "MDVFMKGLSKAKEGVVAAAEKTKQGVAE...",
   }

   embeddings = sequence_encoder_backend(
       sequence_dict=sequences,
       device="cuda:0",
       batch_size=64,
       ionic_strength=150,
       aggregate=True,
       output_directory=None,
   )

   queries = torch.stack([embeddings[name] for name in sequences]).float()
   queries = torch.nn.functional.normalize(queries, dim=1)

   results = engine.search(
       queries=queries,
       k=50,
       nprobe=128,
       return_similarity=True,
       query_sequences=list(sequences.values()),
       exclude_exact=True,
       length_min=50,
       length_max=600,
       max_cosine_similarity=0.95,
       rerank=True,
       rerank_device="cuda:0",
   )

Each row in ``results`` is a list of tuples
``(score, gid, header, length)``. When ``return_similarity`` is ``True`` and the
metric is cosine, ``score`` is the similarity value; otherwise it contains the
raw FAISS distance.

Filtering and reranking
-----------------------

The search engine composes several filter primitives:

* **Length gating** - quickly discard candidates outside a residue range before
  reranking.
* **Exact match suppression** - exclude entries that exactly match the input
  sequence when ``exclude_exact=True``.
* **Identity threshold** - use ``sequence_identity_max`` with
  ``identity_denominator`` (``"query"``, ``"target"``, ``"max"``, ``"min"``, or
  ``"avg"``) to control how similarity is computed.
* **Metric clamps** - ``max_cosine_similarity`` or ``min_l2_distance`` allow
  coarse filtering straight out of FAISS.
* **Reranking** - set ``rerank=True`` to re-embed the top candidates with the
  full encoder for exact scoring. ``rerank_device`` and ``rerank_batch_size``
  manage resources, while ``rerank_ionic_strength`` lets you score under
  different ionic-strength conditions.

Command-line querying
---------------------

.. code-block:: bash

   starling-search query \
       --index /data/indexes/uniref50.faiss \
       --seq MQDRVKRPMNAFIVWSRDQRRKMALENPRMRNSEISKQLGYQWKM \
       --seq MDVFMKGLSKAKEGVVAAAEKTKQGVAEAAGKT \
       --k 20 --nprobe 128 \
       --exclude-exact --sequence-identity-max 0.9 \
       --length-min 40 --length-max 800 \
       --rerank --rerank-device cuda:0 \
       --out search_results.csv

Results can be saved as CSV/JSONL and optionally exported to FASTA for
inspection. Passing ``--index default`` downloads the reference STARLING index
if available and caches it locally.

Next steps
----------

* :doc:`sequence_encoder` - extract embeddings for downstream analysis or
  custom similarity metrics.
* :doc:`ensemble_generation` - generate new ensembles for promising hits.
* :doc:`possible_issues` - diagnose FAISS installation or GPU related
  problems.