starling.structure.ensemble.Ensemble

class Ensemble[source]

Bases: object

Distance-map backed representation of conformational ensembles.

An Ensemble stores pairwise residue distance maps for one or more sampled conformations and exposes convenience methods to analyse, reweight, and serialize the data.

Parameters:
  • distance_maps (np.ndarray) – Array of shape (n_conformations, n_residues, n_residues) containing symmetrised distance maps.

  • sequence (str) – Amino-acid sequence corresponding to the ensemble.

  • ssprot_ensemble (soursop.ssprotein.SSProtein, optional) – Existing SOURSOP trajectory to attach (used when coordinates already exist on disk).

Variables:
  • sequence (str) – Underlying amino-acid sequence.

  • number_of_conformations (int) – Total number of conformations stored in the ensemble.

  • sequence_length (int) – Number of residues in the sequence.

Notes

Most heavy operations (trajectory reconstruction, radius calculations, BME reweighting) are computed lazily and cached for reuse. Use build_ensemble_trajectory() or save_trajectory() to materialise 3D coordinates on demand.

Methods

__init__

Initialize the ensemble with a list of distance maps and the sequence of the protein chain.

build_ensemble_trajectory

Function that explicitly reconstructs a 3D ensemble of conformations using the distance maps.

check_for_errors

Function which scans the ensemble and finds any frames which may be erroneous based on impossible intermolecular distances.

check_for_errors_trajectory

Scan the reconstructed 3D ensemble (the SSProtein trajectory) for frames with physically impossible inter-residue distances.

contact_map

Return the collection of contact maps for the ensemble.

distance_maps

Return the collection of distance maps for the ensemble.

end_to_end_distance

Compute the end-to-end distance of the protein chain for each conformation in the ensemble.

hydrodynamic_radius

Compute the hydrodynamic radius of each conformation using either the Kirkwood-Riseman (mode='kr') or Nygaard (mode='nygaard') equation (default) is Nygaard.

local_radius_of_gyration

Return the local radius of gyration of the protein chain based on a subregion of the chain.

radius_of_gyration

Compute the radius of gyration of the protein chain for each conformation in the ensemble.

reweight_bme

Perform Bayesian Maximum Entropy (BME) reweighting of the ensemble to better match experimental observables while minimizing bias.

rij

Compute the distance between residues i and j for each conformation in the ensemble.

save

Save the ensemble to a file in the STARLING format.

save_trajectory

Save the ensemble trajectory to a file.

view_theta_scan

Visualize the most recent theta scan diagnostics for this Ensemble.

Attributes

bme_result

Get the cached BME result.

has_bme_weights

Check if BME reweighting has been performed.

has_structures

Check if the ensemble has 3D structures.

theta_scan_result

Theta scan result from the most recent BME fit (if auto-theta was used).

trajectory

Return the ensemble trajectory.

__init__(distance_maps, sequence, ssprot_ensemble=None)[source]

Initialize the ensemble with a list of distance maps and the sequence of the protein chain.

Parameters:
  • distance_maps (np.ndarray) – 3D Numpy array of shape (n_conformations, n_residues, n_residues). Note this this expects symmetrized distance maps.

  • sequence (str) – Amino acid sequence of the protein chain.

  • ssprot_ensemble (soursop.ssprotein.SSProtein) – SOURSOP SSProtein object. If provided, the ensemble will be initialized using this.

check_for_errors(remove_errors=False, verbose=True, rebuild_trajectory=False)[source]

Function which scans the ensemble and finds any frames which may be erroneous based on impossible intermolecular distances.

Note if the ensemble has an SSProtein object associated with it and remove_errors is set to true, this will either rebuild the trajectory object from scratch (if rebuild_trajectory=True) or delete the SSProtein object (if rebuild_trajectory=False).

Parameters:
  • remove_errors (bool) – If set to True, the erroneous frames are removed from the ensemble.

  • verbose (bool) – If set to True, the function will print out the indices of the erroneous frames.

  • rebuild_trajectory (bool) – If True, and if remove_errors set to True, AND if the ensemble has trajectory (SSProtein) object associated with it, this will trigger the reconstruction of this trajectory object with the removed frames. If set to False, and if remove_errors set to True, AND if the ensemble has a trajectory (SSProtein) object associated with it, this will delete that object.

  • list – List of indices of the erroneous frames (note if they have been removed) these indices no longer make sense…

check_for_errors_trajectory(remove_errors=False, verbose=True)[source]

Scan the reconstructed 3D ensemble (the SSProtein trajectory) for frames with physically impossible inter-residue distances.

This mirrors check_for_errors(), but instead of inspecting the raw STARLING distance maps it inspects the per-frame CA-CA distance maps derived from the reconstructed 3D coordinates. It is therefore useful for catching reconstruction artefacts (e.g. a SMACOF embedding that introduces unphysical geometry even when the source distance map was well-behaved).

Detection uses the same physical bound as check_for_errors() (utilities.check_distance_map_for_error): a pair of residues separated by |i - j| positions can be at most |i - j| bond lengths apart.

Note that the trajectory frames correspond one-to-one (and in order) to the distance maps. When remove_errors is True, the flagged frames are therefore removed from both the trajectory and the distance maps so the two representations stay in sync, and any cached derived values (Rg, Rh) are invalidated.

Parameters:
  • remove_errors (bool) – If True, the erroneous frames are removed from both the trajectory and the distance maps. Default is False.

  • verbose (bool) – If True, print the number and indices of the erroneous frames (and a message when frames are removed). Default is True.

Returns:

List of indices of the erroneous frames (note that once they have been removed these indices no longer make sense).

Return type:

list

Raises:

RuntimeError – If no SSProtein trajectory is associated with this ensemble. Build one first with build_ensemble_trajectory(), or use check_for_errors() to check the distance maps directly.

rij(i, j, return_mean=False, use_bme_weights=False)[source]

Compute the distance between residues i and j for each conformation in the ensemble.

Parameters:
  • i (int) – Index of the first residue.

  • j (int) – Index of the second residue.

  • return_mean (bool) – If True, returns the mean distance. Default is False.

  • use_bme_weights (bool) – If True, use BME-optimized weights for computing mean. Only applicable if return_mean=True. Default is False.

Returns:

If return_mean=True, returns weighted mean distance. Otherwise, returns array of distances for each conformation.

Return type:

float or np.ndarray

end_to_end_distance(return_mean=False, use_bme_weights=False)[source]

Compute the end-to-end distance of the protein chain for each conformation in the ensemble.

Parameters:
  • return_mean (bool) – If True, returns the mean end-to-end distance of the ensemble.

  • use_bme_weights (bool) – If True, use BME-optimized weights for computing mean. Only applicable if return_mean=True. Default is False.

Returns:

If return_mean=True, returns weighted mean distance. Otherwise, returns array of end-to-end distances for each conformation.

Return type:

float or np.ndarray

distance_maps(return_mean=False, use_bme_weights=False)[source]

Return the collection of distance maps for the ensemble.

Parameters:
  • return_mean (bool) – If True, returns the mean distance map will be returned. Default is False.

  • use_bme_weights (bool) – If True, use BME-optimized weights for computing mean. Only applicable if return_mean=True. Default is False.

Returns:

If return_mean is set to True, returns the mean distance map, otherwise returns the list of distance maps. Each distance map is a 2D numpy array.

Return type:

np.array or list of np.array

contact_map(contact_thresh=11, return_mean=False, return_summed=False)[source]

Return the collection of contact maps for the ensemble. Contacts are defined when residues are within a certain distance. Note only one of return_mean and return_summed can be set to True. If both are set to False, returns the array of instantaneous of contact maps. Else returns the averaged or the sum,ed of the contact maps.

Parameters:
  • contact_thresh (float) – Distance threshold for defining contacts. Default is 11 Angstroms.

  • return_mean (bool) – If True, the average contact map will be returned, meaning each element is between 0 and 1. Default is False.

  • return_summed (bool) – If True, the summed contact map will be returned, meaning each element is an integer. Default is False.

Returns:

If return_mean is set to True, returns the mean distance map, otherwise returns the list of distance maps. Each distance map is a 2D numpy array.

Return type:

np.array or list of np.array

radius_of_gyration(return_mean=False, force_recompute=False, use_bme_weights=False)[source]

Compute the radius of gyration of the protein chain for each conformation in the ensemble.

Parameters:
  • return_mean (bool) – If True, returns the mean radius of gyration of the ensemble. Default is False.

  • force_recompute (bool) – If True, forces recomputation of the radius of gyration, otherwise uses the cached value if previously computed. Default is False.

  • use_bme_weights (bool) – If True, use BME-optimized weights for computing mean. Only applicable if return_mean=True. Default is False.

Returns:

Array of radii of gyration for each conformation in the ensemble. If return_mean is set to true returns the mean value as a float

Return type:

np.array or float

local_radius_of_gyration(start, end, return_mean=False, use_bme_weights=False)[source]

Return the local radius of gyration of the protein chain based on a subregion of the chain.

Parameters:
  • start (int) – The starting residue index.

  • end (int) – The ending residue index.

  • return_mean (bool) – If True, returns the mean radius of gyration of the ensemble. Default is False.

  • use_bme_weights (bool) – If True, use BME-optimized weights for computing mean. Only applicable if return_mean=True. Default is False.

Returns:

Array of radii of gyration for each conformation in the ensemble. If return_mean is set to true returns the mean value as a float

Return type:

np.array or float

hydrodynamic_radius(return_mean=False, force_recompute=False, mode='nygaard', alpha1=0.216, alpha2=4.06, alpha3=0.821)[source]

Compute the hydrodynamic radius of each conformation using either the Kirkwood-Riseman (mode=’kr’) or Nygaard (mode=’nygaard’) equation (default) is Nygaard.

The Kirkwood-Riseman [1] equation may be more accurate when computing the Rh for comparison with NMR-derived Rh values, as reported by Pesce et al. [2]. The Nygaard equation is a more general form of the Kirkwood-Riseman equation, and may offer better agreement with values obtained from dynamic light scattering (DLS) experiments [3].

For ‘kr’ (Kirkwood-Riseman mode), the alpha1/2/3 arguments are ignored, as these are only used with the Nygaard mode.

For ‘nygaard’ mode, the arguments (alpha1/2/3) are used, and should not be altered to recapitulate behaviour defined by Nygaard et al. Default values here are alpha1=0.216, alpha2=4.06 and alpha3=0.821.

NB: If an Rh value is computed and then re-requested with the same mode, the cached value is returned. This is to avoid recomputing the Rh value if it has already been computed. If you want to recompute the Rh value, set force_recompute=True. Also, if Rh with a different mode is requested, the cached value is recomputed automatically. This is to avoid a situation where you request Rh for one mode but actually get a (cached) value for another mode.

[1] Kirkwood & Riseman,(1948). The Intrinsic Viscosities and Diffusion Constants of Flexible Macromolecules in Solution. The Journal of Chemical Physics, 16(6), 565-573.

[2] Pesce et al. Assessment of models for calculating the hydrodynamic radius of intrinsically disordered proteins. Biophys. J. 122, 310-321 (2023).

[3] Nygaard et al. An Efficient Method for Estimating the Hydrodynamic Radius of Disordered Protein Conformations. Biophys J. 2017;113: 550-557.

Parameters:
  • return_mean (bool) – If True, returns the mean hydrodynamic radius of the ensemble. Default is False.

  • force_recompute (bool) – If True, forces recomputation of the hydrodynamic radius, otherwise uses the cached value if previously computed. Default is False.

  • mode (str) – The mode to use for computing the hydrodynamic radius. Options are ‘kr’ (Kirkwood-Riseman) or ‘nygaard’ (Nygaard). Default is ‘nygaard’.

  • alpha1 (float) – First parameter in equation (7) from Nygaard et al. Default = 0.216

  • alpha2 (float) – Second parameter in equation (7) from Nygaard et al. Default = 4.06

  • alpha3 (float) – Third parameter in equation (7) from Nygaard et al. Default = 0.821

Returns:

Array of hydrodynamic radii for each conformation in the ensemble. If return_mean is set to true returns the mean value as a float

Return type:

np.array or float

build_ensemble_trajectory(batch_size=100, num_cpus_mds=2, num_mds_init=4, device=None, force_recompute=False, progress_bar=True)[source]

Function that explicitly reconstructs a 3D ensemble of conformations using the distance maps. This happens automatically if the trajectory property is called, but this function allows for more control over the process. Specifically it allows you to specify the method used to generate the 3D structures, the number of CPUs to use, and the device to use for predictions. Note that if the 3D ensemble has already been reconstructed this function will NOT reconstructed the 3D ensemble unless force_recompute is set to True.

Parameters:
  • num_cpus_mds (int) – The number of CPUs to use for MDS. Default is 4 (set by configs.DEFAULT_CPU_COUNT_MDS)

  • num_mds_init (int) – Number of independent MDS jobs to execute. NB: if this is increased this in principle means there are more chances of finding a good solution, but there is a performance hit unless num_cpus_mds >= num_mds_init. Default is 4 (set by configs.DEFAULT_MDS_NUM_INIT).

  • device (str) – The device to use for predictions. Default is None. If None, the default device will be used. Default device is ‘gpu’. This is MPS for apple silicon and CUDA for all other devices. If MPS and CUDA are not available, automatically falls back to CPU.

  • force_recompute (bool) – If True, forces recomputation of the ensemble trajectory, otherwise uses the cached trajectory if previously computed. Default is False.

  • progress_bar (bool) – If True, displays a progress bar when generating the ensemble trajectory. Default is True.

Returns:

The ensemble trajectory as a SOURSOP Trajectory object. Note that this object

Return type:

soursop.sstrajectory.SSTrajectory

property has_structures

Check if the ensemble has 3D structures.

Returns:

True if the ensemble has 3D structures, False otherwise.

Return type:

bool

property trajectory

Return the ensemble trajectory.

Returns:

The ensemble trajectory as a SOURSOP Trajectory object.

Return type:

soursop.sstrajectory.SSTrajectory

save(filename_prefix, compress=False, reduce_precision=None, compression_algorithm='lzma', verbose=True)[source]

Save the ensemble to a file in the STARLING format. Note this will add the .starling extension to the filename if not provided and will strip of any existing extension passed.

Parameters:
  • filename_prefix (str) – The name of the file to save the ensemble to, excluding file extensions which are added automatically. Note that if you provide a file extension it will be stripped off.

  • compress (bool) – Whether to compress the file or not. Default is False.

  • reduce_precision (bool) – Whether to reduce the precision of the distance map to a single decimal point and cast to float16 if possible. Default is None, and then sets to False if compression is False, but True if compression is True. However it can be manually over-ridden.

  • compression_algorithm (str) – The compression algorithm to use. Options are ‘gzip’ and ‘lzma’. lzma gives better compression if reduce_precision is set to True, but actually ‘gzip’ is better if reduce_precision is False. ‘lzma’ is also slower than ‘gzip’. Default is ‘lzma’.

  • verbose (bool) – Flag to define how noisy we should be

save_trajectory(filename_prefix, pdb_trajectory=False)[source]

Save the ensemble trajectory to a file. This ONLY saves the 3D structural ensemble but does not save the STARLING-generated distance maps. We recommend using save() instead to save the full STARLING object.

Parameters:
  • filename (str) – The name of the file to save the trajectory to, excluding file extensions which are added automatically.

  • pdb_trajectory (bool) – If set to True, the output trajectory is ONLY saved as a PDB file. If set to false, it is saved as a single PDB structure for topology and then the actual trajectory as an XTC file.

reweight_bme(observables, calculated_values, theta=None, theta_range=(0.01, 10.0), theta_n_points=15, max_iterations=50000, optimizer='L-BFGS-B', initial_weights=None, force_recompute=False, verbose=True, show=False, save_theta_scan_plot=None, theta_method: str = 'perpendicular')[source]

Perform Bayesian Maximum Entropy (BME) reweighting of the ensemble to better match experimental observables while minimizing bias.

This method creates optimized weights for each frame that balance fitting experimental data with maintaining ensemble diversity. The result is cached and can be used with other ensemble methods by setting use_bme_weights=True.

Parameters:
  • observables (list[ExperimentalObservable]) – List of experimental observables to fit. Each observable should be an ExperimentalObservable object with value, uncertainty, and constraint type.

  • calculated_values (np.ndarray) – Array of calculated observable values for each frame. Shape: (n_frames, n_observables)

  • theta (float or None, optional) –

    Regularization parameter controlling the trade-off between fitting experimental data and maintaining ensemble diversity. Higher values prefer more diverse ensembles.

    • If None (default): Automatically determine optimal theta using L-curve analysis

    • If float: Use the specified theta value directly

    Default is None (automatic selection).

  • theta_range (tuple, optional) – Range for automatic theta scan: (min_theta, max_theta). Only used when theta=None. Default is (0.01, 10.0).

  • theta_n_points (int, optional) – Number of theta values to test during automatic scan. Only used when theta=None. Default is 30.

  • max_iterations (int, optional) – Maximum number of optimization iterations. Default is 50000.

  • optimizer (str, optional) – Optimization method to use. Default is ‘L-BFGS-B’.

  • initial_weights (np.ndarray, optional) – Initial weights for ensemble frames. If None, uniform weights are used.

  • force_recompute (bool, optional) – If True, forces recomputation even if BME has been previously performed. Default is False.

  • verbose (bool, optional) – If True, print optimization progress. Default is True.

  • save_theta_scan_plot (str, optional) – Path to save theta scan plot (only used when theta=None / auto mode). If None, plot is not saved. Default is None.

  • show (bool, optional) – If True and theta=None (auto mode), display the theta-scan plot. Default is False.

  • theta_method (str, optional) –

    Method to select optimal theta during the scan (only used when theta=None). Options:

    • ’perpendicular’ (default): knee by max perpendicular distance to chord

    • ’curvature’: knee by Menger curvature

Returns:

Object containing optimization results including weights, chi-squared, and convergence information. When theta=None, the result also includes theta_scan_result attribute with details of the automatic selection.

Return type:

BMEResult

Examples

>>> from starling.structure.bme import ExperimentalObservable
>>>
>>> # Example 1: Automatic theta selection (recommended)
>>> obs1 = ExperimentalObservable(25.0, 2.0, name="Rg")
>>> obs2 = ExperimentalObservable(70.0, 5.0, constraint="upper", name="ETE")
>>>
>>> rg_values = ensemble.radius_of_gyration()
>>> ete_values = ensemble.end_to_end_distance()
>>> calculated = np.column_stack([rg_values, ete_values])
>>>
>>> # Automatically finds and uses optimal theta
>>> result = ensemble.reweight_bme([obs1, obs2], calculated)
>>> print(f"Auto-selected theta: {result.metadata['theta_used']}")
>>>
>>> # Example 2: Manual theta specification
>>> result = ensemble.reweight_bme([obs1, obs2], calculated, theta=0.5)
>>>
>>> # Example 3: Custom theta scan range
>>> result = ensemble.reweight_bme(
...     [obs1, obs2], calculated,
...     theta=None,  # auto mode
...     theta_range=(0.1, 5.0),
...     theta_n_points=20,
...     save_theta_scan_plot="my_theta_scan.png"
... )
>>>
>>> # Now use BME weights in other calculations
>>> reweighted_rg = ensemble.radius_of_gyration(use_bme_weights=True)

Notes

Automatic Theta Selection (theta=None)

When theta=None, the method performs an L-curve analysis to find the optimal theta that balances fit quality (Chi squared) and ensemble diversity (phi). This uses Menger curvature to find the “knee” point of the L-curve.

The automatic selection: 1. Tests multiple theta values across the specified range 2. Computes Chi squared and phi for each theta 3. Finds the optimal balance using L-curve analysis 4. Uses the optimal theta for final reweighting 5. Stores scan results in result.metadata[‘theta_scan_result’]

Manual Theta Selection (theta=<value>)

When you specify theta explicitly, no scan is performed and the method uses your value directly. This is faster but requires you to choose theta appropriately for your system.

Which Should You Use?

  • Use theta=None (auto) for most applications, especially when: * You’re unsure what theta value to use * You want reproducible, principled parameter selection * You’re willing to wait ~30x longer (30 BME fits instead of 1)

  • Use theta=<value> (manual) when: * You’ve already determined an appropriate theta value * You need fast reweighting (e.g., in a loop) * You’re doing sensitivity analysis with specific theta values

view_theta_scan(save_path: str | None = None, show: bool = True, figsize: Tuple[int, int] = (18, 10))[source]

Visualize the most recent theta scan diagnostics for this Ensemble.

This wraps ThetaScanResult.plot() from the last BME fit that used automatic theta selection.

Parameters:
  • save_path (str, optional) – Path to save the figure. If None, the figure is not written to disk.

  • show (bool) – If True, display the figure. Default is True.

  • figsize (tuple) – Figure size (width, height). Default is (18, 10).

Returns:

The created figure.

Return type:

matplotlib.figure.Figure

Raises:

ValueError – If no theta scan has been performed yet.

property has_bme_weights

Check if BME reweighting has been performed.

Returns:

True if BME reweighting has been performed, False otherwise.

Return type:

bool

property bme_result

Get the cached BME result.

Returns:

The BME optimization result, or None if reweight_bme() has not been called.

Return type:

BMEResult or None

property theta_scan_result

Theta scan result from the most recent BME fit (if auto-theta was used).

Return type:

ThetaScanResult or None