Ensemble Generation
====================

STARLING provides powerful tools for generating conformational ensembles of intrinsically disordered proteins (IDPs). This guide covers everything from basic usage to advanced options.

.. seealso::

     * :doc:`cli` for command-line generation and conversion helpers.
     * :doc:`constraints` to steer sampling with experimental restraints and
         enable Torch compilation.

Getting Started
----------------

When to Use STARLING
~~~~~~~~~~~~~~~~~~~~

STARLING is designed for:

* Generating structural ensembles of intrinsically disordered proteins (IDPs)
* Predicting conformational properties of disordered regions
* Exploring the conformational space of proteins with significant disorder

Basic Ensemble Generation
--------------------------

Command Line Interface
~~~~~~~~~~~~~~~~~~~~~~

The simplest way to generate an ensemble is using the command-line interface:

.. code-block:: bash

    starling MQDRVKRPMNAFIVWSRDQRRKMALENPRMRNSEISKQLGYQWKMLTEAEKWPFFQEAQKLQAMHREKYPNYKYRPRRKAKMLPK -c 100 --outname example_ensemble

Common Parameters:

* ``-c``: Number of conformations to generate (default: 200)
* ``--outname``: Base name for output files
* ``--ionic_strength``: Ionic strength in mM (default: 150)
* ``--steps``: Number of diffusion steps (default: 30)
* ``--sampler``: Sampling algorithm to use (default: "ddim")

For a complete list of options, run:

.. code-block:: bash

    starling --help

Python API
~~~~~~~~~~

You can also generate ensembles programmatically:

.. code-block:: python

    from starling import generate
    
    # Basic usage with a single sequence
    sequence = "MQDRVKRPMNAFIVWSRDQRRKMALENPRMRNSEISKQLGYQWKMLTEAEKWPFFQEAQKLQAMHREKYPNYKYRPRRKAKMLPK"
    ensemble = generate(sequence, conformations=100)
    ensemble.save("example_ensemble.starling")
    
    # Process multiple sequences at once
    sequences = [
        "GSGSGSGSGSGS",
        "ACDEFGHIKLMNPQRSTVWY"
    ]
    ensembles = generate(sequences, conformations=50)
    
    # Access individual ensembles from the returned dictionary
    for name, ens in ensembles.items():
        print(f"Ensemble {name}: {len(ens)} conformations")
        ens.save(f"{name}_ensemble.starling")

Working with Multiple Input Formats
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

STARLING accepts various input formats:

.. code-block:: python

    # From a dictionary with custom names
    sequence_dict = {
        "protein_A": "GSGSGSGSGSGS",
        "protein_B": "ACDEFGHIKLMNPQRSTVWY"
    }
    ensembles = generate(sequence_dict, conformations=50)
    
    # From a FASTA file
    ensembles = generate("path/to/sequences.fasta", conformations=50)
    
    # From a TSV file (name, sequence format)
    ensembles = generate("path/to/sequences.tsv", conformations=50)

Environment Control
--------------------

Ionic Strength Control
~~~~~~~~~~~~~~~~~~~~~~

STARLING is trained on ensembles generated at three different ionic strengths (20mM, 150mM, 300mM).
You can adjust the ionic strength to model different environments:

Command Line Interface:

.. code-block:: bash

    starling SEQUENCE -c 100 --ionic_strength 150 --outname low_ionic_strength_ensemble

Python API:

.. code-block:: python

    # Generate at physiological ionic strength (150mM)
    ensemble = generate(sequence, conformations=100, ionic_strength=150)

    # Generate at low ionic strength (20mM)
    ensemble = generate(sequence, conformations=100, ionic_strength=20)

    # Generate at high ionic strength (300mM)
    ensemble = generate(sequence, conformations=100, ionic_strength=300)
    
    # Calculate and compare properties at different ionic strengths
    rg_150 = ensemble.radius_of_gyration(return_mean=True)
    print(f"Mean Rg at 150mM: {rg_150:.2f} Å")

Controlling Ensemble Size
~~~~~~~~~~~~~~~~~~~~~~~~~~

Balance quality and performance by adjusting ensemble size:

.. code-block:: python

    # Small ensemble for quick analysis
    small_ensemble = generate(sequence, conformations=20)
    
    # Medium ensemble for standard analysis
    medium_ensemble = generate(sequence, conformations=100)
    
    # Large ensemble for detailed statistical analysis
    large_ensemble = generate(sequence, conformations=500)

Performance Tuning
------------------

Batch and Device Strategies
~~~~~~~~~~~~~~~~~~~~~~~~~~~

Balance throughput and memory use by adjusting hardware-related options:

.. code-block:: python

    ensemble = generate(
        sequences,
        conformations=100,
        device="cuda:0",         # Pin generation to a specific accelerator
        batch_size=64,           # Increase to improve GPU utilisation
        num_cpus_mds=8,          # Allocate more CPUs for 3D reconstruction
        show_progress_bar=True,
        verbose=False,
    )

Remember that ``batch_size`` cannot exceed ``conformations`` and larger values
increase peak memory usage. For CPU-only runs, reduce ``batch_size`` or switch
``device`` to ``"cpu"`` for predictable performance.

Sampler Selection
~~~~~~~~~~~~~~~~~

STARLING supports multiple diffusion samplers so you can trade accuracy for
latency:

.. code-block:: python

    # Deterministic DDIM sampling – faster, deterministic trajectories
    ddim_ensemble = generate(sequence, conformations=100, sampler="ddim", steps=20)

    # Stochastic DDPM sampling – higher fidelity at the cost of runtime
    ddpm_ensemble = generate(sequence, conformations=100, sampler="ddpm", steps=50)

Model Compilation
~~~~~~~~~~~~~~~~~

For repeated predictions, compile the underlying PyTorch models once per
process:

.. code-block:: python

    import starling

    starling.set_compilation_options(enabled=True, mode="reduce-overhead")
    ensemble = generate(sequence, conformations=100)

The first invocation warms up kernels; subsequent calls reuse compiled graphs
and can reduce runtime by ~40% on supported GPUs. See :doc:`constraints`
for advanced compilation options.

Guided Sampling
---------------

Constraint-driven Sampling
~~~~~~~~~~~~~~~~~~~~~~~~~~

STARLING can enforce experimental restraints during diffusion. Pass any
constraint (or list of constraints) from
:mod:`starling.inference.constraints` to the ``constraint`` argument:

.. code-block:: python

    from starling.inference.constraints import DistanceConstraint

    constraint = DistanceConstraint(
        resid1=10,
        resid2=200,
        target=50.0,
        tolerance=2.0,
        force_constant=2.5,
    )
    ensemble = generate(sequence, conformations=200, constraint=constraint)

Combine multiple constraints or tune ``force_constant``/``guidance`` settings to
steer sampling toward experimental observables. Visit :doc:`constraints`
for a catalogue of available restraints and tuning advice.

Saving and Loading Ensembles
------------------------------

Saving Ensembles
~~~~~~~~~~~~~~~~~

Save ensembles in STARLING format for later use:

.. code-block:: python

    # Save with default options
    ensemble.save("my_ensemble")
    
    # Save with compression for smaller file size
    ensemble.save("my_ensemble_compressed", compress=True)
    
    # Auto-save during generation
    ensemble = generate(
        sequence, 
        conformations=100, 
        output_directory="results"
    )

Loading Ensembles
~~~~~~~~~~~~~~~~~

Load previously generated ensembles:

.. code-block:: python

    from starling.structure.ensemble import load_ensemble
    
    # Load an ensemble
    ensemble = load_ensemble("my_ensemble.starling")
    
    # Load without 3D structures for faster loading
    ensemble = load_ensemble("my_ensemble.starling", ignore_structures=True)
    
    print(f"Loaded ensemble with {len(ensemble)} conformations")

Output Files and Conversion
---------------------------

STARLING generates output in its native format, which can be converted to common molecular formats:

.. code-block:: bash

    # Convert to PDB trajectory
    starling2pdb example_ensemble.starling

    # Convert to XTC/PDB for molecular dynamics software
    starling2xtc example_ensemble.starling

    # Drop physically impossible reconstructed frames before writing
    starling2xtc example_ensemble.starling --remove-errors

Both ``starling2pdb`` and ``starling2xtc`` accept ``--remove-errors``, which
scans the reconstructed trajectory and discards any frames with physically
impossible inter-residue distances before the trajectory is written. See
:doc:`cli` for details.

From Python:

.. code-block:: python

    # Save directly to PDB trajectory format
    ensemble.save_trajectory("my_structures", pdb_trajectory=True)
    
    # Save as PDB/XTC combination
    ensemble.save_trajectory("my_structures")

Tips and Troubleshooting
-------------------------

Common Issues
~~~~~~~~~~~~~~

* **Memory errors**: Reduce batch_size or conformations if you encounter CUDA out of memory errors
* **Long sequences**: STARLING has a maximum sequence length limit; consider dividing long proteins into domains
* **Performance**: Use GPU acceleration when available for significantly faster generation
* **Invalid amino acids**: Only standard 20 amino acids are supported; other characters will be rejected