starling.data.tokenizer.StarlingTokenizer

class StarlingTokenizer[source]

Bases: object

Lightweight amino-acid tokenizer used across STARLING.

The tokenizer exposes fast byte-level encode/decode helpers that map between protein sequences and integer vocab IDs. It is optimized for bulk processing inside the sequence encoder and FAISS tokenization pipeline.

Examples

>>> from starling.data.tokenizer import StarlingTokenizer
>>> tok = StarlingTokenizer()
>>> ids = tok.encode("ACDE")
>>> ids
[1, 2, 3, 4]
>>> tok.decode(ids)
'ACDE'

Notes

  • Unknown characters raise KeyError during encoding.

  • Padding/reserved 0 tokens are stripped during decode.

Methods

__init__

decode

Decode a sequence of ints into a string, dropping zeros (padding).

encode

Encode a string of amino acids into a list[int].

Attributes

aa_to_int

int_to_aa

aa_to_int = {'0': 0, 'A': 1, 'C': 2, 'D': 3, 'E': 4, 'F': 5, 'G': 6, 'H': 7, 'I': 8, 'K': 9, 'L': 10, 'M': 11, 'N': 12, 'P': 13, 'Q': 14, 'R': 15, 'S': 16, 'T': 17, 'V': 18, 'W': 19, 'Y': 20}
int_to_aa = {0: '0', 1: 'A', 2: 'C', 3: 'D', 4: 'E', 5: 'F', 6: 'G', 7: 'H', 8: 'I', 9: 'K', 10: 'L', 11: 'M', 12: 'N', 13: 'P', 14: 'Q', 15: 'R', 16: 'S', 17: 'T', 18: 'V', 19: 'W', 20: 'Y'}
__init__()[source]
encode(sequence: str)[source]

Encode a string of amino acids into a list[int]. Raises ValueError if an unknown character is present.

decode(sequence)[source]

Decode a sequence of ints into a string, dropping zeros (padding). Accepts list/tuple[int], bytes, or bytearray.