starling.search.builder.IndexBuilder
- class IndexBuilder[source]
Bases:
objectBuilds FAISS OPQ+IVF-PQ indexes from sharded feature files.
Methods
Build an OPQ+IVF-PQ index.
Build a SequenceStore SQLite database from tokenized sequence files.
Randomly sample vectors across all shards for training.
Save index to disk (converts GPU to CPU if needed).
Save a JSON manifest with index metadata.
- __init__(root: str, metric: str = 'cosine', verbose: bool = True, shard_id_regex: str | None = None)[source]
- sample_vectors(sample_size: int = 100000, seed: int = 1234) Tensor[source]
Randomly sample vectors across all shards for training.
- Parameters:
- Returns:
Sampled vectors of shape (sample_size, dim)
- Return type:
- Raises:
RuntimeError – If no samples are collected
- build_index(index_path: str, sample_size: int = 655360, nlist: int = 16384, m: int = 64, nbits: int = 8, use_gpu: bool = True, add_batch_size: int = 100000, nprobe: int = 16, gpu_device: int = 0, gpu_fp16_lut: bool = True, use_opq: bool = True, tokens_dir: str | None = None, compress_sequences: bool = False) faiss.Index[source]
Build an OPQ+IVF-PQ index.
- Parameters:
index_path (str) – Path to save the built index (e.g., “my_index.faiss”)
sample_size (int, optional) – Number of samples to use for training, by default 655_360
nlist (int, optional) – Number of inverted file list (IVF) partitions, by default 16384
m (int, optional) – Number of subquantizers, by default 64
nbits (int, optional) – Number of bits per subvector, by default 8
use_gpu (bool, optional) – Whether to use GPU for training, by default True
add_batch_size (int, optional) – Batch size for adding vectors to the index, by default 100_000
nprobe (int, optional) – Number of probes for the IVFPQ index, by default 16
gpu_device (int, optional) – GPU device ID to use (if use_gpu is True), by default 0
gpu_fp16_lut (bool, optional) – Whether to use float16 lookup tables (if use_gpu is True), by default True
use_opq (bool, optional) – Whether to use Optimized Product Quantization (OPQ), by default True
tokens_dir (Optional[str], optional) – Directory containing tokenized sequences, by default None
compress_sequences (bool, optional) – Whether to compress sequences, by default False
- Returns:
The built FAISS index.
- Return type:
faiss.Index
- Raises:
RuntimeError – If the index cannot be built.
- save_index(index: faiss.Index, index_path: str) None[source]
Save index to disk (converts GPU to CPU if needed).
- save_manifest(path: str, *, nlist: int, m: int, nbits: int, sample_size: int) None[source]
Save a JSON manifest with index metadata.
- build_sequence_store(db_path: str, tokens_dir: str, compress: bool = False, batch: int = 50000) None[source]
Build a SequenceStore SQLite database from tokenized sequence files.
- Parameters:
db_path (str) – The path to the SQLite database file to create.
tokens_dir (str) – The directory containing the tokenized sequence files.
compress (bool, optional) – Whether to compress the stored sequences, by default False.
batch (int, optional) – The number of rows to insert in each batch, by default 50_000.
- Raises:
FileNotFoundError – If a tokenized sequence file is missing.
ValueError – If there is a length mismatch between tokens and features.