starling.search.builder.IndexBuilder

class IndexBuilder[source]

Bases: object

Builds FAISS OPQ+IVF-PQ indexes from sharded feature files.

Methods

`__init__`
`build_index`	Build an OPQ+IVF-PQ index.
`build_sequence_store`	Build a SequenceStore SQLite database from tokenized sequence files.
`sample_vectors`	Randomly sample vectors across all shards for training.
`save_index`	Save index to disk (converts GPU to CPU if needed).
`save_manifest`	Save a JSON manifest with index metadata.

__init__(root: str, metric: str = 'cosine', verbose: bool = True, shard_id_regex: str | None = None)[source]

sample_vectors(sample_size: int = 100000, seed: int = 1234) → Tensor[source]

Randomly sample vectors across all shards for training.

Parameters:

sample_size (int, optional) – Number of vectors to sample, by default 100_000
seed (int, optional) – Random seed for reproducibility, by default 1234

Returns:

Sampled vectors of shape (sample_size, dim)

Return type:

torch.Tensor

Raises:

RuntimeError – If no samples are collected

build_index(index_path: str, sample_size: int = 655360, nlist: int = 16384, m: int = 64, nbits: int = 8, use_gpu: bool = True, add_batch_size: int = 100000, nprobe: int = 16, gpu_device: int = 0, gpu_fp16_lut: bool = True, use_opq: bool = True, tokens_dir: str | None = None, compress_sequences: bool = False) → faiss.Index[source]

Build an OPQ+IVF-PQ index.

Parameters:

index_path (str) – Path to save the built index (e.g., “my_index.faiss”)
sample_size (int, optional) – Number of samples to use for training, by default 655_360
nlist (int, optional) – Number of inverted file list (IVF) partitions, by default 16384
m (int, optional) – Number of subquantizers, by default 64
nbits (int, optional) – Number of bits per subvector, by default 8
use_gpu (bool, optional) – Whether to use GPU for training, by default True
add_batch_size (int, optional) – Batch size for adding vectors to the index, by default 100_000
nprobe (int, optional) – Number of probes for the IVFPQ index, by default 16
gpu_device (int, optional) – GPU device ID to use (if use_gpu is True), by default 0
gpu_fp16_lut (bool, optional) – Whether to use float16 lookup tables (if use_gpu is True), by default True
use_opq (bool, optional) – Whether to use Optimized Product Quantization (OPQ), by default True
tokens_dir (Optional[str], optional) – Directory containing tokenized sequences, by default None
compress_sequences (bool, optional) – Whether to compress sequences, by default False

Returns:

The built FAISS index.

Return type:

faiss.Index

Raises:

RuntimeError – If the index cannot be built.

save_index(index: faiss.Index, index_path: str) → None[source]: Save index to disk (converts GPU to CPU if needed).

save_manifest(path: str, *, nlist: int, m: int, nbits: int, sample_size: int) → None[source]: Save a JSON manifest with index metadata.

build_sequence_store(db_path: str, tokens_dir: str, compress: bool = False, batch: int = 50000) → None[source]

Build a SequenceStore SQLite database from tokenized sequence files.

Parameters:

db_path (str) – The path to the SQLite database file to create.
tokens_dir (str) – The directory containing the tokenized sequence files.
compress (bool, optional) – Whether to compress the stored sequences, by default False.
batch (int, optional) – The number of rows to insert in each batch, by default 50_000.

Raises:

FileNotFoundError – If a tokenized sequence file is missing.
ValueError – If there is a length mismatch between tokens and features.