Ensemble Generation
STARLING provides powerful tools for generating conformational ensembles of intrinsically disordered proteins (IDPs). This guide covers everything from basic usage to advanced options.
See also
Command Line Interface for command-line generation and conversion helpers.
- Guided Sampling with Constraints to steer sampling with experimental restraints and
enable Torch compilation.
Getting Started
When to Use STARLING
STARLING is designed for:
Generating structural ensembles of intrinsically disordered proteins (IDPs)
Predicting conformational properties of disordered regions
Exploring the conformational space of proteins with significant disorder
Basic Ensemble Generation
Command Line Interface
The simplest way to generate an ensemble is using the command-line interface:
starling MQDRVKRPMNAFIVWSRDQRRKMALENPRMRNSEISKQLGYQWKMLTEAEKWPFFQEAQKLQAMHREKYPNYKYRPRRKAKMLPK -c 100 --outname example_ensemble
Common Parameters:
-c: Number of conformations to generate (default: 200)--outname: Base name for output files--ionic_strength: Ionic strength in mM (default: 150)--steps: Number of diffusion steps (default: 30)--sampler: Sampling algorithm to use (default: “ddim”)
For a complete list of options, run:
starling --help
Python API
You can also generate ensembles programmatically:
from starling import generate
# Basic usage with a single sequence
sequence = "MQDRVKRPMNAFIVWSRDQRRKMALENPRMRNSEISKQLGYQWKMLTEAEKWPFFQEAQKLQAMHREKYPNYKYRPRRKAKMLPK"
ensemble = generate(sequence, conformations=100)
ensemble.save("example_ensemble.starling")
# Process multiple sequences at once
sequences = [
"GSGSGSGSGSGS",
"ACDEFGHIKLMNPQRSTVWY"
]
ensembles = generate(sequences, conformations=50)
# Access individual ensembles from the returned dictionary
for name, ens in ensembles.items():
print(f"Ensemble {name}: {len(ens)} conformations")
ens.save(f"{name}_ensemble.starling")
Working with Multiple Input Formats
STARLING accepts various input formats:
# From a dictionary with custom names
sequence_dict = {
"protein_A": "GSGSGSGSGSGS",
"protein_B": "ACDEFGHIKLMNPQRSTVWY"
}
ensembles = generate(sequence_dict, conformations=50)
# From a FASTA file
ensembles = generate("path/to/sequences.fasta", conformations=50)
# From a TSV file (name, sequence format)
ensembles = generate("path/to/sequences.tsv", conformations=50)
Environment Control
Ionic Strength Control
STARLING is trained on ensembles generated at three different ionic strengths (20mM, 150mM, 300mM). You can adjust the ionic strength to model different environments:
Command Line Interface:
starling SEQUENCE -c 100 --ionic_strength 150 --outname low_ionic_strength_ensemble
Python API:
# Generate at physiological ionic strength (150mM)
ensemble = generate(sequence, conformations=100, ionic_strength=150)
# Generate at low ionic strength (20mM)
ensemble = generate(sequence, conformations=100, ionic_strength=20)
# Generate at high ionic strength (300mM)
ensemble = generate(sequence, conformations=100, ionic_strength=300)
# Calculate and compare properties at different ionic strengths
rg_150 = ensemble.radius_of_gyration(return_mean=True)
print(f"Mean Rg at 150mM: {rg_150:.2f} Å")
Controlling Ensemble Size
Balance quality and performance by adjusting ensemble size:
# Small ensemble for quick analysis
small_ensemble = generate(sequence, conformations=20)
# Medium ensemble for standard analysis
medium_ensemble = generate(sequence, conformations=100)
# Large ensemble for detailed statistical analysis
large_ensemble = generate(sequence, conformations=500)
Performance Tuning
Batch and Device Strategies
Balance throughput and memory use by adjusting hardware-related options:
ensemble = generate(
sequences,
conformations=100,
device="cuda:0", # Pin generation to a specific accelerator
batch_size=64, # Increase to improve GPU utilisation
num_cpus_mds=8, # Allocate more CPUs for 3D reconstruction
show_progress_bar=True,
verbose=False,
)
Remember that batch_size cannot exceed conformations and larger values
increase peak memory usage. For CPU-only runs, reduce batch_size or switch
device to "cpu" for predictable performance.
Sampler Selection
STARLING supports multiple diffusion samplers so you can trade accuracy for latency:
# Deterministic DDIM sampling – faster, deterministic trajectories
ddim_ensemble = generate(sequence, conformations=100, sampler="ddim", steps=20)
# Stochastic DDPM sampling – higher fidelity at the cost of runtime
ddpm_ensemble = generate(sequence, conformations=100, sampler="ddpm", steps=50)
Model Compilation
For repeated predictions, compile the underlying PyTorch models once per process:
import starling
starling.set_compilation_options(enabled=True, mode="reduce-overhead")
ensemble = generate(sequence, conformations=100)
The first invocation warms up kernels; subsequent calls reuse compiled graphs and can reduce runtime by ~40% on supported GPUs. See Guided Sampling with Constraints for advanced compilation options.
Guided Sampling
Constraint-driven Sampling
STARLING can enforce experimental restraints during diffusion. Pass any
constraint (or list of constraints) from
starling.inference.constraints to the constraint argument:
from starling.inference.constraints import DistanceConstraint
constraint = DistanceConstraint(
resid1=10,
resid2=200,
target=50.0,
tolerance=2.0,
force_constant=2.5,
)
ensemble = generate(sequence, conformations=200, constraint=constraint)
Combine multiple constraints or tune force_constant/guidance settings to
steer sampling toward experimental observables. Visit Guided Sampling with Constraints
for a catalogue of available restraints and tuning advice.
Saving and Loading Ensembles
Saving Ensembles
Save ensembles in STARLING format for later use:
# Save with default options
ensemble.save("my_ensemble")
# Save with compression for smaller file size
ensemble.save("my_ensemble_compressed", compress=True)
# Auto-save during generation
ensemble = generate(
sequence,
conformations=100,
output_directory="results"
)
Loading Ensembles
Load previously generated ensembles:
from starling.structure.ensemble import load_ensemble
# Load an ensemble
ensemble = load_ensemble("my_ensemble.starling")
# Load without 3D structures for faster loading
ensemble = load_ensemble("my_ensemble.starling", ignore_structures=True)
print(f"Loaded ensemble with {len(ensemble)} conformations")
Output Files and Conversion
STARLING generates output in its native format, which can be converted to common molecular formats:
# Convert to PDB trajectory
starling2pdb example_ensemble.starling
# Convert to XTC/PDB for molecular dynamics software
starling2xtc example_ensemble.starling
# Drop physically impossible reconstructed frames before writing
starling2xtc example_ensemble.starling --remove-errors
Both starling2pdb and starling2xtc accept --remove-errors, which
scans the reconstructed trajectory and discards any frames with physically
impossible inter-residue distances before the trajectory is written. See
Command Line Interface for details.
From Python:
# Save directly to PDB trajectory format
ensemble.save_trajectory("my_structures", pdb_trajectory=True)
# Save as PDB/XTC combination
ensemble.save_trajectory("my_structures")
Tips and Troubleshooting
Common Issues
Memory errors: Reduce batch_size or conformations if you encounter CUDA out of memory errors
Long sequences: STARLING has a maximum sequence length limit; consider dividing long proteins into domains
Performance: Use GPU acceleration when available for significantly faster generation
Invalid amino acids: Only standard 20 amino acids are supported; other characters will be rejected