starling.search.store

SQLite Sequence Store

High-performance SQLite-backed storage for sequence metadata with compression support.

Overview

The SequenceStore provides fast, indexed access to sequence data:

  • Indexed Lookups: O(log N) access by GID, length, or hash

  • Compression: Optional zstd compression

  • Batch Operations: Efficient bulk insert and multi-get operations

  • Thread-Safe Reads: Immutable read-only connections never lock

  • Atomic Writes: Build in temp file, atomically publish when complete

Database Schema

The store uses a single table with indexes:

CREATE TABLE sequences (
    gid       INTEGER PRIMARY KEY,  -- Global sequence ID
    len       INTEGER NOT NULL,     -- Sequence length (amino acids)
    hash8     INTEGER,               -- 8-byte SHA1 hash for dedup
    seq       BLOB NOT NULL,         -- Compressed sequence
    shard     INTEGER,               -- Source shard ID
    local_idx INTEGER,               -- Index within shard
    header    BLOB                   -- Compressed header (optional)
);

CREATE INDEX idx_len ON sequences(len);      -- Length-based filtering
CREATE INDEX idx_hash8 ON sequences(hash8);  -- Hash-based dedup
CREATE INDEX idx_header ON sequences(header);

Basic Usage

Writing (Build Time):

>>> from starling.search import SequenceStore
>>>
>>> # Open writer (builds in temp file)
>>> store = SequenceStore.open_writer("sequences.sqlite")
>>>
>>> # Prepare rows: (gid, len, hash8, seq_blob, shard, local_idx, header_blob)
>>> rows = []
>>> for gid, seq in enumerate(sequences):
...     seq_blob = SequenceStore.encode_seq(seq, compress=True)
...     header_blob = SequenceStore.encode_header(header, compress=True)
...     hash8 = SequenceStore.hash8(seq)
...     rows.append((gid, len(seq), hash8, seq_blob, shard_id, local_idx, header_blob))
>>>
>>> # Bulk insert (very fast)
>>> store.insert_rows(rows)
>>>
>>> # Commit and atomically publish
>>> store.close_publish()

Reading (Query Time):

>>> # Open reader (immutable, never locks)
>>> store = SequenceStore.open_reader("sequences.sqlite")
>>>
>>> # Get single sequence
>>> seq = store.get_seq(gid=12345)
>>>
>>> # Get header and length
>>> header, length = store.get_header_len(gid=12345)
>>>
>>> # Batch fetch (efficient)
>>> gids = [100, 200, 300, 400, 500]
>>> metadata = store.get_many_meta(gids)
>>> for gid, header, length, hash8 in metadata:
...     print(f"{gid}: {header} (len={length})")
>>>
>>> # Length-based filtering (uses index)
>>> gids_in_range = store.get_gids_by_length_range(min_len=50, max_len=500)

Compression

The store supports optional zstd compression if available.

Blob Encoding Format

Sequences and headers are stored as BLOBs with a 1-byte flag:

[flag: 1 byte][payload: N bytes]

flag = 0x00: Plain UTF-8
flag = 0x01: zstd compressed UTF-8

This allows mixing compressed and uncompressed data in the same database.

Writer Workflow

The writer uses a safe build-and-publish pattern:

  1. Build in temp file: Unique temp path prevents conflicts

  2. Write-optimized PRAGMAs: journal_mode=OFF, synchronous=OFF

  3. Bulk inserts: Use executemany() for batching

  4. Atomic publish: os.replace() ensures all-or-nothing

This prevents corrupting existing databases during builds.

Reader Workflow

Readers use immutable connections:

  1. mode=ro: Read-only access

  2. immutable=1: Never checks for schema changes

  3. cache=private: Per-connection page cache

  4. Never blocks: Multiple concurrent readers

Methods Reference

Class Methods:

  • open_writer(path): Create writer for building database

  • open_reader(path): Open immutable reader connection

Writer Methods:

  • insert_rows(rows): Bulk insert rows

  • close_publish(): Commit, optimize, and atomically publish

  • close(): Close connection without publishing

Reader Methods:

  • get_seq(gid): Get sequence string by GID

  • get_header_len(gid): Get (header, length) tuple

  • get_many_header_len(gids): Batch fetch header+length

  • get_many_meta(gids): Batch fetch header+length+hash8

  • get_gids_by_length_range(min_len, max_len): Find GIDs by length

Static Methods:

  • hash8(seq): Compute 8-byte hash of sequence

  • encode_seq(seq, compress): Encode sequence to BLOB

  • decode_seq(blob): Decode BLOB to sequence

  • encode_header(header, compress): Encode header to BLOB

  • decode_header(blob): Decode BLOB to header

Threading and Concurrency

Readers: Thread-safe and lock-free

  • Multiple threads can share one reader

  • Multiple processes can open separate readers

  • Never blocks other readers or writers

Writers: Single-threaded

  • One writer at a time per database

  • Builds in isolated temp file

  • Atomically publishes when complete

Common Patterns

Pattern 1: Build during index creation

>>> store = SequenceStore.open_writer("index.faiss.seqs.sqlite")
>>> for shard_id, sequences in enumerate(shards):
...     rows = [(gid, len(s), hash8(s), encode_seq(s, True), ...)
...             for gid, s in sequences]
...     store.insert_rows(rows)
>>> store.close_publish()

Pattern 2: Query sequences during search

>>> store = SequenceStore.open_reader("index.faiss.seqs.sqlite")
>>> gids = [result[1] for result in search_results]  # Extract GIDs
>>> metadata = store.get_many_meta(gids)
>>> for gid, header, length, hash8 in metadata:
...     seq = store.get_seq(gid)  # Lazy fetch if needed

Pattern 3: Length-based pre-filtering

>>> # Find all sequences of exact length (exact match search)
>>> gids = store.get_gids_by_length_range(min_len=68, max_len=68)
>>> # Use as selector for FAISS search (huge speedup!)

See also

  • IndexBuilder: Uses SequenceStore during build

  • SearchEngine: Uses SequenceStore for filtering/metadata

  • starling.search.cli: Command-line interface

Notes

  • Batch operations are orders of magnitude faster than single gets

  • Length-based filtering is extremely fast due to indexing

  • Writer temp files are automatically cleaned up on publish

Classes

SequenceStore

SQLite-backed per-gid sequence metadata store.