starling.search.store

SQLite Sequence Store

High-performance SQLite-backed storage for sequence metadata with compression support.

Overview

The SequenceStore provides fast, indexed access to sequence data:

Indexed Lookups: O(log N) access by GID, length, or hash
Compression: Optional zstd compression
Batch Operations: Efficient bulk insert and multi-get operations
Thread-Safe Reads: Immutable read-only connections never lock
Atomic Writes: Build in temp file, atomically publish when complete

Database Schema

The store uses a single table with indexes:

CREATE TABLE sequences (
    gid       INTEGER PRIMARY KEY,  -- Global sequence ID
    len       INTEGER NOT NULL,     -- Sequence length (amino acids)
    hash8     INTEGER,               -- 8-byte SHA1 hash for dedup
    seq       BLOB NOT NULL,         -- Compressed sequence
    shard     INTEGER,               -- Source shard ID
    local_idx INTEGER,               -- Index within shard
    header    BLOB                   -- Compressed header (optional)
);

CREATE INDEX idx_len ON sequences(len);      -- Length-based filtering
CREATE INDEX idx_hash8 ON sequences(hash8);  -- Hash-based dedup
CREATE INDEX idx_header ON sequences(header);

Basic Usage

Writing (Build Time):

>>> from starling.search import SequenceStore
>>>
>>> # Open writer (builds in temp file)
>>> store = SequenceStore.open_writer("sequences.sqlite")
>>>
>>> # Prepare rows: (gid, len, hash8, seq_blob, shard, local_idx, header_blob)
>>> rows = []
>>> for gid, seq in enumerate(sequences):
...     seq_blob = SequenceStore.encode_seq(seq, compress=True)
...     header_blob = SequenceStore.encode_header(header, compress=True)
...     hash8 = SequenceStore.hash8(seq)
...     rows.append((gid, len(seq), hash8, seq_blob, shard_id, local_idx, header_blob))
>>>
>>> # Bulk insert (very fast)
>>> store.insert_rows(rows)
>>>
>>> # Commit and atomically publish
>>> store.close_publish()

Reading (Query Time):

>>> # Open reader (immutable, never locks)
>>> store = SequenceStore.open_reader("sequences.sqlite")
>>>
>>> # Get single sequence
>>> seq = store.get_seq(gid=12345)
>>>
>>> # Get header and length
>>> header, length = store.get_header_len(gid=12345)
>>>
>>> # Batch fetch (efficient)
>>> gids = [100, 200, 300, 400, 500]
>>> metadata = store.get_many_meta(gids)
>>> for gid, header, length, hash8 in metadata:
...     print(f"{gid}: {header} (len={length})")
>>>
>>> # Length-based filtering (uses index)
>>> gids_in_range = store.get_gids_by_length_range(min_len=50, max_len=500)

Compression

The store supports optional zstd compression if available.

Blob Encoding Format

Sequences and headers are stored as BLOBs with a 1-byte flag:

[flag: 1 byte][payload: N bytes]

flag = 0x00: Plain UTF-8
flag = 0x01: zstd compressed UTF-8

This allows mixing compressed and uncompressed data in the same database.

Writer Workflow

The writer uses a safe build-and-publish pattern:

Build in temp file: Unique temp path prevents conflicts
Write-optimized PRAGMAs: journal_mode=OFF, synchronous=OFF
Bulk inserts: Use executemany() for batching
Atomic publish: os.replace() ensures all-or-nothing

This prevents corrupting existing databases during builds.

Reader Workflow

Readers use immutable connections:

mode=ro: Read-only access
immutable=1: Never checks for schema changes
cache=private: Per-connection page cache
Never blocks: Multiple concurrent readers

Methods Reference

Class Methods:

open_writer(path): Create writer for building database
open_reader(path): Open immutable reader connection

Writer Methods:

insert_rows(rows): Bulk insert rows
close_publish(): Commit, optimize, and atomically publish
close(): Close connection without publishing

Reader Methods:

get_seq(gid): Get sequence string by GID
get_header_len(gid): Get (header, length) tuple
get_many_header_len(gids): Batch fetch header+length
get_many_meta(gids): Batch fetch header+length+hash8
get_gids_by_length_range(min_len, max_len): Find GIDs by length

Static Methods:

hash8(seq): Compute 8-byte hash of sequence
encode_seq(seq, compress): Encode sequence to BLOB
decode_seq(blob): Decode BLOB to sequence
encode_header(header, compress): Encode header to BLOB
decode_header(blob): Decode BLOB to header

Threading and Concurrency

Readers: Thread-safe and lock-free

Multiple threads can share one reader
Multiple processes can open separate readers
Never blocks other readers or writers

Writers: Single-threaded

One writer at a time per database
Builds in isolated temp file
Atomically publishes when complete

Common Patterns

Pattern 1: Build during index creation

>>> store = SequenceStore.open_writer("index.faiss.seqs.sqlite")
>>> for shard_id, sequences in enumerate(shards):
...     rows = [(gid, len(s), hash8(s), encode_seq(s, True), ...)
...             for gid, s in sequences]
...     store.insert_rows(rows)
>>> store.close_publish()

Pattern 2: Query sequences during search

>>> store = SequenceStore.open_reader("index.faiss.seqs.sqlite")
>>> gids = [result[1] for result in search_results]  # Extract GIDs
>>> metadata = store.get_many_meta(gids)
>>> for gid, header, length, hash8 in metadata:
...     seq = store.get_seq(gid)  # Lazy fetch if needed

Pattern 3: Length-based pre-filtering

>>> # Find all sequences of exact length (exact match search)
>>> gids = store.get_gids_by_length_range(min_len=68, max_len=68)
>>> # Use as selector for FAISS search (huge speedup!)