starling.search.builder

FAISS Index Builder

Build high-performance FAISS indexes with OPQ+IVF-PQ compression and integrated SQLite sequence metadata storage.

Overview

The IndexBuilder creates FAISS indexes optimized for billion-scale similarity search:

  • OPQ (Optimized Product Quantization): Learned rotation for better quantization

  • IVF (Inverted File): Clustering for fast approximate search

  • PQ (Product Quantization): Compression to ~64-128 bytes per vector

  • Sequence Store: SQLite database with sequences, headers, and metadata

Basic Usage

>>> from starling.search import IndexBuilder
>>>
>>> # Initialize and discover shards
>>> builder = IndexBuilder(
...     root="/path/to/shards",
...     metric="cosine",
...     verbose=True
... )
>>>
>>> # Build index with sequence store
>>> builder.build_index(
...     index_path="my_index.faiss",
...     tokens_dir="/path/to/tokens",
...     sample_size=655360,     # Training samples
...     nlist=16384,            # IVF clusters
...     m=64,                   # PQ subquantizers
...     nbits=8,                # Bits per subquantizer
...     use_gpu=True,           # GPU acceleration
...     use_opq=True,           # Enable OPQ
...     compress_sequences=True # Compress sequences with zstd
... )

Index Configuration

Training Parameters:

  • sample_size: Number of vectors for training (default: 655,360)

    • Larger = better quantization quality

    • Minimum: 39 * nlist vectors required

    • Recommended: 100-1000 vectors per IVF cluster

  • nlist: Number of IVF clusters (default: 16,384)

    • More clusters = faster search but lower recall

    • Rule of thumb: sqrt(N) to N/1000 where N is total vectors

    • Auto-adjusted if training samples insufficient

  • m: PQ subquantizers (default: 64)

    • Must divide vector dimension evenly

    • More = better quality but larger memory footprint

    • Typical: 32-128 depending on dimension

  • nbits: Bits per subquantizer (default: 8)

    • Controls codebook size: 2^nbits entries per subquantizer

    • 8 bits = 256 entries (standard)

    • 16 bits = 65536 entries (higher quality, more memory)

GPU Options:

  • use_gpu: Train on GPU

  • gpu_device: CUDA device ID

  • gpu_fp16_lut: Use float16 lookup tables (to address 48KB SMEM limit)

Sequence Store:

  • tokens_dir: Directory with tokenized sequences (.tokens.pt files)

  • compress_sequences: Use zstd compression

File Structure

The builder expects sharded feature files in this structure:

root/
├── uniref50_idrs_only_000000/
│   └── sequence_features.pt
├── uniref50_idrs_only_000001/
│   └── sequence_features.pt
└── ...

And produces these outputs:

my_index.faiss              # FAISS index
my_index.faiss.manifest.json # Metadata
my_index.faiss.seqs.sqlite  # Sequence store (if tokens_dir provided)

See also

  • SearchEngine: Query the built index

  • SequenceStore: Direct database access

  • starling.search.cli: Command-line interface

Notes

  • Use verbose=True for progress tracking during long builds

  • GPU training dramatically faster but index quality identical

Classes

IndexBuilder

Builds FAISS OPQ+IVF-PQ indexes from sharded feature files.