starling.sequence_encoder
- sequence_encoder(sequence_dict, ionic_strength=150, batch_size=32, aggregate=False, device=None, output_directory=None, encoder_path=None, ddpm_path=None, pretokenized: bool = False, bucket: bool = False, bucket_size: int = 32, free_cuda_cache: bool = False, return_on_cpu: bool = True)[source]
Embed sequences with the STARLING encoder.
- Parameters:
sequence_dict (str | Sequence[str] | dict[str, str]) – Input sequences to encode. Accepts a FASTA/TSV path, a single sequence, a list of sequences, or a mapping of identifiers to sequences. The helper
handle_input()normalizes the value into a{name: sequence}dictionary and validates residue alphabets.ionic_strength (int, optional) – Ionic strength (in mM) to condition the encoder. Valid values are typically 20, 150, or 300, matching the training regimes. Defaults to
configs.DEFAULT_IONIC_STRENGTH.batch_size (int, optional) – Number of sequences to process per batch.
aggregate (bool, optional) – When
Truethe function returns a single embedding vector per sequence using mean pooling. WhenFalse(default) residue-level embeddings are returned.device (str | torch.device | None, optional) – Device hint forwarded to
utilities.check_device().Nonelets STARLING pick the best available accelerator.output_directory (str | pathlib.Path | None, optional) – Directory for optional on-disk exports. Leave
Noneto keep embeddings in memory only.encoder_path (str | None, optional) – Override the default encoder checkpoint.
ddpm_path (str | None, optional) – Override the default diffusion checkpoint used by the shared model manager.
pretokenized (bool, optional) – Set to
Truewhensequence_dictalready contains cached tokens to skip preprocessing.bucket (bool, optional) – Enable adaptive bucketing by sequence length to reduce padding waste in large batches.
bucket_size (int, optional) – Maximum number of unique lengths per bucket when
bucketisTrue. Ignored otherwise.free_cuda_cache (bool, optional) – Release CUDA memory after each batch for long inference jobs.
return_on_cpu (bool, optional) – Convert embeddings to CPU tensors before returning. Set to
Falseto keep them on the selected device for downstream GPU workflows.
- Returns:
Mapping from sequence identifiers to embedding tensors. The trailing tensor shape is
(L, D)for residue-level embeddings or(D,)for aggregated embeddings, whereLis sequence length andDis the latent dimension.- Return type:
Notes
The encoder shares weights with the ensemble generator, so successive calls reuse cached models through
generation.model_manager. Useencoder_pathandddpm_pathto experiment with fine-tuned weights without mutating global configuration.