starling.search.builder
FAISS Index Builder
Build high-performance FAISS indexes with OPQ+IVF-PQ compression and integrated SQLite sequence metadata storage.
Overview
The IndexBuilder creates FAISS indexes optimized for billion-scale similarity search:
OPQ (Optimized Product Quantization): Learned rotation for better quantization
IVF (Inverted File): Clustering for fast approximate search
PQ (Product Quantization): Compression to ~64-128 bytes per vector
Sequence Store: SQLite database with sequences, headers, and metadata
Basic Usage
>>> from starling.search import IndexBuilder
>>>
>>> # Initialize and discover shards
>>> builder = IndexBuilder(
... root="/path/to/shards",
... metric="cosine",
... verbose=True
... )
>>>
>>> # Build index with sequence store
>>> builder.build_index(
... index_path="my_index.faiss",
... tokens_dir="/path/to/tokens",
... sample_size=655360, # Training samples
... nlist=16384, # IVF clusters
... m=64, # PQ subquantizers
... nbits=8, # Bits per subquantizer
... use_gpu=True, # GPU acceleration
... use_opq=True, # Enable OPQ
... compress_sequences=True # Compress sequences with zstd
... )
Index Configuration
Training Parameters:
sample_size: Number of vectors for training (default: 655,360)Larger = better quantization quality
Minimum: 39 * nlist vectors required
Recommended: 100-1000 vectors per IVF cluster
nlist: Number of IVF clusters (default: 16,384)More clusters = faster search but lower recall
Rule of thumb: sqrt(N) to N/1000 where N is total vectors
Auto-adjusted if training samples insufficient
m: PQ subquantizers (default: 64)Must divide vector dimension evenly
More = better quality but larger memory footprint
Typical: 32-128 depending on dimension
nbits: Bits per subquantizer (default: 8)Controls codebook size: 2^nbits entries per subquantizer
8 bits = 256 entries (standard)
16 bits = 65536 entries (higher quality, more memory)
GPU Options:
use_gpu: Train on GPUgpu_device: CUDA device IDgpu_fp16_lut: Use float16 lookup tables (to address 48KB SMEM limit)
Sequence Store:
tokens_dir: Directory with tokenized sequences (.tokens.pt files)compress_sequences: Use zstd compression
File Structure
The builder expects sharded feature files in this structure:
root/
├── uniref50_idrs_only_000000/
│ └── sequence_features.pt
├── uniref50_idrs_only_000001/
│ └── sequence_features.pt
└── ...
And produces these outputs:
my_index.faiss # FAISS index
my_index.faiss.manifest.json # Metadata
my_index.faiss.seqs.sqlite # Sequence store (if tokens_dir provided)
See also
SearchEngine: Query the built indexSequenceStore: Direct database accessstarling.search.cli: Command-line interface
Notes
Use
verbose=Truefor progress tracking during long buildsGPU training dramatically faster but index quality identical
Classes
Builds FAISS OPQ+IVF-PQ indexes from sharded feature files. |