starling.search.builder.IndexBuilder

class IndexBuilder[source]

Bases: object

Builds FAISS OPQ+IVF-PQ indexes from sharded feature files.

Methods

__init__

build_index

Build an OPQ+IVF-PQ index.

build_sequence_store

Build a SequenceStore SQLite database from tokenized sequence files.

sample_vectors

Randomly sample vectors across all shards for training.

save_index

Save index to disk (converts GPU to CPU if needed).

save_manifest

Save a JSON manifest with index metadata.

__init__(root: str, metric: str = 'cosine', verbose: bool = True, shard_id_regex: str | None = None)[source]
sample_vectors(sample_size: int = 100000, seed: int = 1234) Tensor[source]

Randomly sample vectors across all shards for training.

Parameters:
  • sample_size (int, optional) – Number of vectors to sample, by default 100_000

  • seed (int, optional) – Random seed for reproducibility, by default 1234

Returns:

Sampled vectors of shape (sample_size, dim)

Return type:

torch.Tensor

Raises:

RuntimeError – If no samples are collected

build_index(index_path: str, sample_size: int = 655360, nlist: int = 16384, m: int = 64, nbits: int = 8, use_gpu: bool = True, add_batch_size: int = 100000, nprobe: int = 16, gpu_device: int = 0, gpu_fp16_lut: bool = True, use_opq: bool = True, tokens_dir: str | None = None, compress_sequences: bool = False) faiss.Index[source]

Build an OPQ+IVF-PQ index.

Parameters:
  • index_path (str) – Path to save the built index (e.g., “my_index.faiss”)

  • sample_size (int, optional) – Number of samples to use for training, by default 655_360

  • nlist (int, optional) – Number of inverted file list (IVF) partitions, by default 16384

  • m (int, optional) – Number of subquantizers, by default 64

  • nbits (int, optional) – Number of bits per subvector, by default 8

  • use_gpu (bool, optional) – Whether to use GPU for training, by default True

  • add_batch_size (int, optional) – Batch size for adding vectors to the index, by default 100_000

  • nprobe (int, optional) – Number of probes for the IVFPQ index, by default 16

  • gpu_device (int, optional) – GPU device ID to use (if use_gpu is True), by default 0

  • gpu_fp16_lut (bool, optional) – Whether to use float16 lookup tables (if use_gpu is True), by default True

  • use_opq (bool, optional) – Whether to use Optimized Product Quantization (OPQ), by default True

  • tokens_dir (Optional[str], optional) – Directory containing tokenized sequences, by default None

  • compress_sequences (bool, optional) – Whether to compress sequences, by default False

Returns:

The built FAISS index.

Return type:

faiss.Index

Raises:

RuntimeError – If the index cannot be built.

save_index(index: faiss.Index, index_path: str) None[source]

Save index to disk (converts GPU to CPU if needed).

save_manifest(path: str, *, nlist: int, m: int, nbits: int, sample_size: int) None[source]

Save a JSON manifest with index metadata.

build_sequence_store(db_path: str, tokens_dir: str, compress: bool = False, batch: int = 50000) None[source]

Build a SequenceStore SQLite database from tokenized sequence files.

Parameters:
  • db_path (str) – The path to the SQLite database file to create.

  • tokens_dir (str) – The directory containing the tokenized sequence files.

  • compress (bool, optional) – Whether to compress the stored sequences, by default False.

  • batch (int, optional) – The number of rows to insert in each batch, by default 50_000.

Raises:
  • FileNotFoundError – If a tokenized sequence file is missing.

  • ValueError – If there is a length mismatch between tokens and features.