starling.search.store
SQLite Sequence Store
High-performance SQLite-backed storage for sequence metadata with compression support.
Overview
The SequenceStore provides fast, indexed access to sequence data:
Indexed Lookups: O(log N) access by GID, length, or hash
Compression: Optional zstd compression
Batch Operations: Efficient bulk insert and multi-get operations
Thread-Safe Reads: Immutable read-only connections never lock
Atomic Writes: Build in temp file, atomically publish when complete
Database Schema
The store uses a single table with indexes:
CREATE TABLE sequences (
gid INTEGER PRIMARY KEY, -- Global sequence ID
len INTEGER NOT NULL, -- Sequence length (amino acids)
hash8 INTEGER, -- 8-byte SHA1 hash for dedup
seq BLOB NOT NULL, -- Compressed sequence
shard INTEGER, -- Source shard ID
local_idx INTEGER, -- Index within shard
header BLOB -- Compressed header (optional)
);
CREATE INDEX idx_len ON sequences(len); -- Length-based filtering
CREATE INDEX idx_hash8 ON sequences(hash8); -- Hash-based dedup
CREATE INDEX idx_header ON sequences(header);
Basic Usage
Writing (Build Time):
>>> from starling.search import SequenceStore
>>>
>>> # Open writer (builds in temp file)
>>> store = SequenceStore.open_writer("sequences.sqlite")
>>>
>>> # Prepare rows: (gid, len, hash8, seq_blob, shard, local_idx, header_blob)
>>> rows = []
>>> for gid, seq in enumerate(sequences):
... seq_blob = SequenceStore.encode_seq(seq, compress=True)
... header_blob = SequenceStore.encode_header(header, compress=True)
... hash8 = SequenceStore.hash8(seq)
... rows.append((gid, len(seq), hash8, seq_blob, shard_id, local_idx, header_blob))
>>>
>>> # Bulk insert (very fast)
>>> store.insert_rows(rows)
>>>
>>> # Commit and atomically publish
>>> store.close_publish()
Reading (Query Time):
>>> # Open reader (immutable, never locks)
>>> store = SequenceStore.open_reader("sequences.sqlite")
>>>
>>> # Get single sequence
>>> seq = store.get_seq(gid=12345)
>>>
>>> # Get header and length
>>> header, length = store.get_header_len(gid=12345)
>>>
>>> # Batch fetch (efficient)
>>> gids = [100, 200, 300, 400, 500]
>>> metadata = store.get_many_meta(gids)
>>> for gid, header, length, hash8 in metadata:
... print(f"{gid}: {header} (len={length})")
>>>
>>> # Length-based filtering (uses index)
>>> gids_in_range = store.get_gids_by_length_range(min_len=50, max_len=500)
Compression
The store supports optional zstd compression if available.
Blob Encoding Format
Sequences and headers are stored as BLOBs with a 1-byte flag:
[flag: 1 byte][payload: N bytes]
flag = 0x00: Plain UTF-8
flag = 0x01: zstd compressed UTF-8
This allows mixing compressed and uncompressed data in the same database.
Writer Workflow
The writer uses a safe build-and-publish pattern:
Build in temp file: Unique temp path prevents conflicts
Write-optimized PRAGMAs: journal_mode=OFF, synchronous=OFF
Bulk inserts: Use executemany() for batching
Atomic publish: os.replace() ensures all-or-nothing
This prevents corrupting existing databases during builds.
Reader Workflow
Readers use immutable connections:
mode=ro: Read-only access
immutable=1: Never checks for schema changes
cache=private: Per-connection page cache
Never blocks: Multiple concurrent readers
Methods Reference
Class Methods:
open_writer(path): Create writer for building databaseopen_reader(path): Open immutable reader connection
Writer Methods:
insert_rows(rows): Bulk insert rowsclose_publish(): Commit, optimize, and atomically publishclose(): Close connection without publishing
Reader Methods:
get_seq(gid): Get sequence string by GIDget_header_len(gid): Get (header, length) tupleget_many_header_len(gids): Batch fetch header+lengthget_many_meta(gids): Batch fetch header+length+hash8get_gids_by_length_range(min_len, max_len): Find GIDs by length
Static Methods:
hash8(seq): Compute 8-byte hash of sequenceencode_seq(seq, compress): Encode sequence to BLOBdecode_seq(blob): Decode BLOB to sequenceencode_header(header, compress): Encode header to BLOBdecode_header(blob): Decode BLOB to header
Threading and Concurrency
Readers: Thread-safe and lock-free
Multiple threads can share one reader
Multiple processes can open separate readers
Never blocks other readers or writers
Writers: Single-threaded
One writer at a time per database
Builds in isolated temp file
Atomically publishes when complete
Common Patterns
Pattern 1: Build during index creation
>>> store = SequenceStore.open_writer("index.faiss.seqs.sqlite")
>>> for shard_id, sequences in enumerate(shards):
... rows = [(gid, len(s), hash8(s), encode_seq(s, True), ...)
... for gid, s in sequences]
... store.insert_rows(rows)
>>> store.close_publish()
Pattern 2: Query sequences during search
>>> store = SequenceStore.open_reader("index.faiss.seqs.sqlite")
>>> gids = [result[1] for result in search_results] # Extract GIDs
>>> metadata = store.get_many_meta(gids)
>>> for gid, header, length, hash8 in metadata:
... seq = store.get_seq(gid) # Lazy fetch if needed
Pattern 3: Length-based pre-filtering
>>> # Find all sequences of exact length (exact match search)
>>> gids = store.get_gids_by_length_range(min_len=68, max_len=68)
>>> # Use as selector for FAISS search (huge speedup!)
See also
IndexBuilder: Uses SequenceStore during buildSearchEngine: Uses SequenceStore for filtering/metadatastarling.search.cli: Command-line interface
Notes
Batch operations are orders of magnitude faster than single gets
Length-based filtering is extremely fast due to indexing
Writer temp files are automatically cleaned up on publish
Classes
SQLite-backed per-gid sequence metadata store. |