starling.sequence_encoder

sequence_encoder(sequence_dict, ionic_strength=150, batch_size=32, aggregate=False, device=None, output_directory=None, encoder_path=None, ddpm_path=None, pretokenized: bool = False, bucket: bool = False, bucket_size: int = 32, free_cuda_cache: bool = False, return_on_cpu: bool = True)[source]

Embed sequences with the STARLING encoder.

Parameters:

sequence_dict (str | Sequence[str] | dict[str, str]) – Input sequences to encode. Accepts a FASTA/TSV path, a single sequence, a list of sequences, or a mapping of identifiers to sequences. The helper handle_input() normalizes the value into a {name: sequence} dictionary and validates residue alphabets.
ionic_strength (int, optional) – Ionic strength (in mM) to condition the encoder. Valid values are typically 20, 150, or 300, matching the training regimes. Defaults to configs.DEFAULT_IONIC_STRENGTH.
batch_size (int, optional) – Number of sequences to process per batch.
aggregate (bool, optional) – When True the function returns a single embedding vector per sequence using mean pooling. When False (default) residue-level embeddings are returned.
device (str | torch.device | None, optional) – Device hint forwarded to utilities.check_device(). None lets STARLING pick the best available accelerator.
output_directory (str | pathlib.Path | None, optional) – Directory for optional on-disk exports. Leave None to keep embeddings in memory only.
encoder_path (str | None, optional) – Override the default encoder checkpoint.
ddpm_path (str | None, optional) – Override the default diffusion checkpoint used by the shared model manager.
pretokenized (bool, optional) – Set to True when sequence_dict already contains cached tokens to skip preprocessing.
bucket (bool, optional) – Enable adaptive bucketing by sequence length to reduce padding waste in large batches.
bucket_size (int, optional) – Maximum number of unique lengths per bucket when bucket is True. Ignored otherwise.
free_cuda_cache (bool, optional) – Release CUDA memory after each batch for long inference jobs.
return_on_cpu (bool, optional) – Convert embeddings to CPU tensors before returning. Set to False to keep them on the selected device for downstream GPU workflows.

Returns:

Mapping from sequence identifiers to embedding tensors. The trailing tensor shape is (L, D) for residue-level embeddings or (D,) for aggregated embeddings, where L is sequence length and D is the latent dimension.

Return type:

dict[str, torch.Tensor]

Notes

The encoder shares weights with the ensemble generator, so successive calls reuse cached models through generation.model_manager. Use encoder_path and ddpm_path to experiment with fine-tuned weights without mutating global configuration.