Ensemble Generation

STARLING provides powerful tools for generating conformational ensembles of intrinsically disordered proteins (IDPs). This guide covers everything from basic usage to advanced options.

Getting Started

When to Use STARLING

STARLING is designed for:

Generating structural ensembles of intrinsically disordered proteins (IDPs)
Predicting conformational properties of disordered regions
Exploring the conformational space of proteins with significant disorder

Basic Ensemble Generation

Command Line Interface

The simplest way to generate an ensemble is using the command-line interface:

starling MQDRVKRPMNAFIVWSRDQRRKMALENPRMRNSEISKQLGYQWKMLTEAEKWPFFQEAQKLQAMHREKYPNYKYRPRRKAKMLPK -c 100 --outname example_ensemble

Common Parameters:

-c: Number of conformations to generate (default: 200)
--outname: Base name for output files
--ionic_strength: Ionic strength in mM (default: 150)
--steps: Number of diffusion steps (default: 30)
--sampler: Sampling algorithm to use (default: “ddim”)

For a complete list of options, run:

starling --help

Python API

You can also generate ensembles programmatically:

from starling import generate

# Basic usage with a single sequence
sequence = "MQDRVKRPMNAFIVWSRDQRRKMALENPRMRNSEISKQLGYQWKMLTEAEKWPFFQEAQKLQAMHREKYPNYKYRPRRKAKMLPK"
ensemble = generate(sequence, conformations=100)
ensemble.save("example_ensemble.starling")

# Process multiple sequences at once
sequences = [
    "GSGSGSGSGSGS",
    "ACDEFGHIKLMNPQRSTVWY"
]
ensembles = generate(sequences, conformations=50)

# Access individual ensembles from the returned dictionary
for name, ens in ensembles.items():
    print(f"Ensemble {name}: {len(ens)} conformations")
    ens.save(f"{name}_ensemble.starling")

Working with Multiple Input Formats

STARLING accepts various input formats:

# From a dictionary with custom names
sequence_dict = {
    "protein_A": "GSGSGSGSGSGS",
    "protein_B": "ACDEFGHIKLMNPQRSTVWY"
}
ensembles = generate(sequence_dict, conformations=50)

# From a FASTA file
ensembles = generate("path/to/sequences.fasta", conformations=50)

# From a TSV file (name, sequence format)
ensembles = generate("path/to/sequences.tsv", conformations=50)

Environment Control

Ionic Strength Control

STARLING is trained on ensembles generated at three different ionic strengths (20mM, 150mM, 300mM). You can adjust the ionic strength to model different environments:

Command Line Interface:

starling SEQUENCE -c 100 --ionic_strength 150 --outname low_ionic_strength_ensemble

Python API:

# Generate at physiological ionic strength (150mM)
ensemble = generate(sequence, conformations=100, ionic_strength=150)

# Generate at low ionic strength (20mM)
ensemble = generate(sequence, conformations=100, ionic_strength=20)

# Generate at high ionic strength (300mM)
ensemble = generate(sequence, conformations=100, ionic_strength=300)

# Calculate and compare properties at different ionic strengths
rg_150 = ensemble.radius_of_gyration(return_mean=True)
print(f"Mean Rg at 150mM: {rg_150:.2f} Å")

Controlling Ensemble Size

Balance quality and performance by adjusting ensemble size:

# Small ensemble for quick analysis
small_ensemble = generate(sequence, conformations=20)

# Medium ensemble for standard analysis
medium_ensemble = generate(sequence, conformations=100)

# Large ensemble for detailed statistical analysis
large_ensemble = generate(sequence, conformations=500)

Performance Tuning

Batch and Device Strategies

Balance throughput and memory use by adjusting hardware-related options:

ensemble = generate(
    sequences,
    conformations=100,
    device="cuda:0",         # Pin generation to a specific accelerator
    batch_size=64,           # Increase to improve GPU utilisation
    num_cpus_mds=8,          # Allocate more CPUs for 3D reconstruction
    show_progress_bar=True,
    verbose=False,
)

Remember that batch_size cannot exceed conformations and larger values increase peak memory usage. For CPU-only runs, reduce batch_size or switch device to "cpu" for predictable performance.

Sampler Selection

STARLING supports multiple diffusion samplers so you can trade accuracy for latency:

# Deterministic DDIM sampling – faster, deterministic trajectories
ddim_ensemble = generate(sequence, conformations=100, sampler="ddim", steps=20)

# Stochastic DDPM sampling – higher fidelity at the cost of runtime
ddpm_ensemble = generate(sequence, conformations=100, sampler="ddpm", steps=50)

Model Compilation

For repeated predictions, compile the underlying PyTorch models once per process:

import starling

starling.set_compilation_options(enabled=True, mode="reduce-overhead")
ensemble = generate(sequence, conformations=100)

The first invocation warms up kernels; subsequent calls reuse compiled graphs and can reduce runtime by ~40% on supported GPUs. See Guided Sampling with Constraints for advanced compilation options.

Guided Sampling

Constraint-driven Sampling

STARLING can enforce experimental restraints during diffusion. Pass any constraint (or list of constraints) from starling.inference.constraints to the constraint argument:

from starling.inference.constraints import DistanceConstraint

constraint = DistanceConstraint(
    resid1=10,
    resid2=200,
    target=50.0,
    tolerance=2.0,
    force_constant=2.5,
)
ensemble = generate(sequence, conformations=200, constraint=constraint)

Combine multiple constraints or tune force_constant/guidance settings to steer sampling toward experimental observables. Visit Guided Sampling with Constraints for a catalogue of available restraints and tuning advice.

Saving and Loading Ensembles

Saving Ensembles

Save ensembles in STARLING format for later use:

# Save with default options
ensemble.save("my_ensemble")

# Save with compression for smaller file size
ensemble.save("my_ensemble_compressed", compress=True)

# Auto-save during generation
ensemble = generate(
    sequence,
    conformations=100,
    output_directory="results"
)

Loading Ensembles

Load previously generated ensembles:

from starling.structure.ensemble import load_ensemble

# Load an ensemble
ensemble = load_ensemble("my_ensemble.starling")

# Load without 3D structures for faster loading
ensemble = load_ensemble("my_ensemble.starling", ignore_structures=True)

print(f"Loaded ensemble with {len(ensemble)} conformations")

Output Files and Conversion

STARLING generates output in its native format, which can be converted to common molecular formats:

# Convert to PDB trajectory
starling2pdb example_ensemble.starling

# Convert to XTC/PDB for molecular dynamics software
starling2xtc example_ensemble.starling

# Drop physically impossible reconstructed frames before writing
starling2xtc example_ensemble.starling --remove-errors

Both starling2pdb and starling2xtc accept --remove-errors, which scans the reconstructed trajectory and discards any frames with physically impossible inter-residue distances before the trajectory is written. See Command Line Interface for details.

From Python:

# Save directly to PDB trajectory format
ensemble.save_trajectory("my_structures", pdb_trajectory=True)

# Save as PDB/XTC combination
ensemble.save_trajectory("my_structures")

Tips and Troubleshooting

Common Issues

Memory errors: Reduce batch_size or conformations if you encounter CUDA out of memory errors
Long sequences: STARLING has a maximum sequence length limit; consider dividing long proteins into domains
Performance: Use GPU acceleration when available for significantly faster generation
Invalid amino acids: Only standard 20 amino acids are supported; other characters will be rejected