starling.data.tokenizer.StarlingTokenizer
- class StarlingTokenizer[source]
Bases:
objectLightweight amino-acid tokenizer used across STARLING.
The tokenizer exposes fast byte-level
encode/decodehelpers that map between protein sequences and integer vocab IDs. It is optimized for bulk processing inside the sequence encoder and FAISS tokenization pipeline.Examples
>>> from starling.data.tokenizer import StarlingTokenizer >>> tok = StarlingTokenizer() >>> ids = tok.encode("ACDE") >>> ids [1, 2, 3, 4] >>> tok.decode(ids) 'ACDE'
Notes
Unknown characters raise
KeyErrorduring encoding.Padding/reserved
0tokens are stripped during decode.
Methods
Decode a sequence of ints into a string, dropping zeros (padding).
Encode a string of amino acids into a list[int].
Attributes
- aa_to_int = {'0': 0, 'A': 1, 'C': 2, 'D': 3, 'E': 4, 'F': 5, 'G': 6, 'H': 7, 'I': 8, 'K': 9, 'L': 10, 'M': 11, 'N': 12, 'P': 13, 'Q': 14, 'R': 15, 'S': 16, 'T': 17, 'V': 18, 'W': 19, 'Y': 20}
- int_to_aa = {0: '0', 1: 'A', 2: 'C', 3: 'D', 4: 'E', 5: 'F', 6: 'G', 7: 'H', 8: 'I', 9: 'K', 10: 'L', 11: 'M', 12: 'N', 13: 'P', 14: 'Q', 15: 'R', 16: 'S', 17: 'T', 18: 'V', 19: 'W', 20: 'Y'}