starling.structure.ensemble.Ensemble
- class Ensemble[source]
Bases:
objectDistance-map backed representation of conformational ensembles.
An
Ensemblestores pairwise residue distance maps for one or more sampled conformations and exposes convenience methods to analyse, reweight, and serialize the data.- Parameters:
distance_maps (np.ndarray) – Array of shape
(n_conformations, n_residues, n_residues)containing symmetrised distance maps.sequence (str) – Amino-acid sequence corresponding to the ensemble.
ssprot_ensemble (soursop.ssprotein.SSProtein, optional) – Existing SOURSOP trajectory to attach (used when coordinates already exist on disk).
- Variables:
Notes
Most heavy operations (trajectory reconstruction, radius calculations, BME reweighting) are computed lazily and cached for reuse. Use
build_ensemble_trajectory()orsave_trajectory()to materialise 3D coordinates on demand.Methods
Initialize the ensemble with a list of distance maps and the sequence of the protein chain.
Function that explicitly reconstructs a 3D ensemble of conformations using the distance maps.
Function which scans the ensemble and finds any frames which may be erroneous based on impossible intermolecular distances.
Scan the reconstructed 3D ensemble (the SSProtein trajectory) for frames with physically impossible inter-residue distances.
Return the collection of contact maps for the ensemble.
Return the collection of distance maps for the ensemble.
Compute the end-to-end distance of the protein chain for each conformation in the ensemble.
Compute the hydrodynamic radius of each conformation using either the Kirkwood-Riseman (mode='kr') or Nygaard (mode='nygaard') equation (default) is Nygaard.
Return the local radius of gyration of the protein chain based on a subregion of the chain.
Compute the radius of gyration of the protein chain for each conformation in the ensemble.
Perform Bayesian Maximum Entropy (BME) reweighting of the ensemble to better match experimental observables while minimizing bias.
Compute the distance between residues i and j for each conformation in the ensemble.
Save the ensemble to a file in the STARLING format.
Save the ensemble trajectory to a file.
Visualize the most recent theta scan diagnostics for this Ensemble.
Attributes
Get the cached BME result.
Check if BME reweighting has been performed.
Check if the ensemble has 3D structures.
Theta scan result from the most recent BME fit (if auto-theta was used).
Return the ensemble trajectory.
- __init__(distance_maps, sequence, ssprot_ensemble=None)[source]
Initialize the ensemble with a list of distance maps and the sequence of the protein chain.
- Parameters:
distance_maps (np.ndarray) – 3D Numpy array of shape (n_conformations, n_residues, n_residues). Note this this expects symmetrized distance maps.
sequence (str) – Amino acid sequence of the protein chain.
ssprot_ensemble (soursop.ssprotein.SSProtein) – SOURSOP SSProtein object. If provided, the ensemble will be initialized using this.
- check_for_errors(remove_errors=False, verbose=True, rebuild_trajectory=False)[source]
Function which scans the ensemble and finds any frames which may be erroneous based on impossible intermolecular distances.
Note if the ensemble has an SSProtein object associated with it and remove_errors is set to true, this will either rebuild the trajectory object from scratch (if rebuild_trajectory=True) or delete the SSProtein object (if rebuild_trajectory=False).
- Parameters:
remove_errors (bool) – If set to True, the erroneous frames are removed from the ensemble.
verbose (bool) – If set to True, the function will print out the indices of the erroneous frames.
rebuild_trajectory (bool) – If True, and if remove_errors set to True, AND if the ensemble has trajectory (SSProtein) object associated with it, this will trigger the reconstruction of this trajectory object with the removed frames. If set to False, and if remove_errors set to True, AND if the ensemble has a trajectory (SSProtein) object associated with it, this will delete that object.
list – List of indices of the erroneous frames (note if they have been removed) these indices no longer make sense…
- check_for_errors_trajectory(remove_errors=False, verbose=True)[source]
Scan the reconstructed 3D ensemble (the SSProtein trajectory) for frames with physically impossible inter-residue distances.
This mirrors
check_for_errors(), but instead of inspecting the raw STARLING distance maps it inspects the per-frame CA-CA distance maps derived from the reconstructed 3D coordinates. It is therefore useful for catching reconstruction artefacts (e.g. a SMACOF embedding that introduces unphysical geometry even when the source distance map was well-behaved).Detection uses the same physical bound as
check_for_errors()(utilities.check_distance_map_for_error): a pair of residues separated by|i - j|positions can be at most|i - j|bond lengths apart.Note that the trajectory frames correspond one-to-one (and in order) to the distance maps. When
remove_errorsis True, the flagged frames are therefore removed from both the trajectory and the distance maps so the two representations stay in sync, and any cached derived values (Rg, Rh) are invalidated.- Parameters:
- Returns:
List of indices of the erroneous frames (note that once they have been removed these indices no longer make sense).
- Return type:
- Raises:
RuntimeError – If no SSProtein trajectory is associated with this ensemble. Build one first with
build_ensemble_trajectory(), or usecheck_for_errors()to check the distance maps directly.
- rij(i, j, return_mean=False, use_bme_weights=False)[source]
Compute the distance between residues i and j for each conformation in the ensemble.
- Parameters:
- Returns:
If return_mean=True, returns weighted mean distance. Otherwise, returns array of distances for each conformation.
- Return type:
float or np.ndarray
- end_to_end_distance(return_mean=False, use_bme_weights=False)[source]
Compute the end-to-end distance of the protein chain for each conformation in the ensemble.
- Parameters:
- Returns:
If return_mean=True, returns weighted mean distance. Otherwise, returns array of end-to-end distances for each conformation.
- Return type:
float or np.ndarray
- distance_maps(return_mean=False, use_bme_weights=False)[source]
Return the collection of distance maps for the ensemble.
- Parameters:
- Returns:
If return_mean is set to True, returns the mean distance map, otherwise returns the list of distance maps. Each distance map is a 2D numpy array.
- Return type:
np.array or list of np.array
- contact_map(contact_thresh=11, return_mean=False, return_summed=False)[source]
Return the collection of contact maps for the ensemble. Contacts are defined when residues are within a certain distance. Note only one of return_mean and return_summed can be set to True. If both are set to False, returns the array of instantaneous of contact maps. Else returns the averaged or the sum,ed of the contact maps.
- Parameters:
contact_thresh (float) – Distance threshold for defining contacts. Default is 11 Angstroms.
return_mean (bool) – If True, the average contact map will be returned, meaning each element is between 0 and 1. Default is False.
return_summed (bool) – If True, the summed contact map will be returned, meaning each element is an integer. Default is False.
- Returns:
If return_mean is set to True, returns the mean distance map, otherwise returns the list of distance maps. Each distance map is a 2D numpy array.
- Return type:
np.array or list of np.array
- radius_of_gyration(return_mean=False, force_recompute=False, use_bme_weights=False)[source]
Compute the radius of gyration of the protein chain for each conformation in the ensemble.
- Parameters:
return_mean (bool) – If True, returns the mean radius of gyration of the ensemble. Default is False.
force_recompute (bool) – If True, forces recomputation of the radius of gyration, otherwise uses the cached value if previously computed. Default is False.
use_bme_weights (bool) – If True, use BME-optimized weights for computing mean. Only applicable if return_mean=True. Default is False.
- Returns:
Array of radii of gyration for each conformation in the ensemble. If return_mean is set to true returns the mean value as a float
- Return type:
np.array or float
- local_radius_of_gyration(start, end, return_mean=False, use_bme_weights=False)[source]
Return the local radius of gyration of the protein chain based on a subregion of the chain.
- Parameters:
start (int) – The starting residue index.
end (int) – The ending residue index.
return_mean (bool) – If True, returns the mean radius of gyration of the ensemble. Default is False.
use_bme_weights (bool) – If True, use BME-optimized weights for computing mean. Only applicable if return_mean=True. Default is False.
- Returns:
Array of radii of gyration for each conformation in the ensemble. If return_mean is set to true returns the mean value as a float
- Return type:
np.array or float
- hydrodynamic_radius(return_mean=False, force_recompute=False, mode='nygaard', alpha1=0.216, alpha2=4.06, alpha3=0.821)[source]
Compute the hydrodynamic radius of each conformation using either the Kirkwood-Riseman (mode=’kr’) or Nygaard (mode=’nygaard’) equation (default) is Nygaard.
The Kirkwood-Riseman [1] equation may be more accurate when computing the Rh for comparison with NMR-derived Rh values, as reported by Pesce et al. [2]. The Nygaard equation is a more general form of the Kirkwood-Riseman equation, and may offer better agreement with values obtained from dynamic light scattering (DLS) experiments [3].
For ‘kr’ (Kirkwood-Riseman mode), the alpha1/2/3 arguments are ignored, as these are only used with the Nygaard mode.
For ‘nygaard’ mode, the arguments (alpha1/2/3) are used, and should not be altered to recapitulate behaviour defined by Nygaard et al. Default values here are alpha1=0.216, alpha2=4.06 and alpha3=0.821.
NB: If an Rh value is computed and then re-requested with the same mode, the cached value is returned. This is to avoid recomputing the Rh value if it has already been computed. If you want to recompute the Rh value, set force_recompute=True. Also, if Rh with a different mode is requested, the cached value is recomputed automatically. This is to avoid a situation where you request Rh for one mode but actually get a (cached) value for another mode.
[1] Kirkwood & Riseman,(1948). The Intrinsic Viscosities and Diffusion Constants of Flexible Macromolecules in Solution. The Journal of Chemical Physics, 16(6), 565-573.
[2] Pesce et al. Assessment of models for calculating the hydrodynamic radius of intrinsically disordered proteins. Biophys. J. 122, 310-321 (2023).
[3] Nygaard et al. An Efficient Method for Estimating the Hydrodynamic Radius of Disordered Protein Conformations. Biophys J. 2017;113: 550-557.
- Parameters:
return_mean (bool) – If True, returns the mean hydrodynamic radius of the ensemble. Default is False.
force_recompute (bool) – If True, forces recomputation of the hydrodynamic radius, otherwise uses the cached value if previously computed. Default is False.
mode (str) – The mode to use for computing the hydrodynamic radius. Options are ‘kr’ (Kirkwood-Riseman) or ‘nygaard’ (Nygaard). Default is ‘nygaard’.
alpha1 (float) – First parameter in equation (7) from Nygaard et al. Default = 0.216
alpha2 (float) – Second parameter in equation (7) from Nygaard et al. Default = 4.06
alpha3 (float) – Third parameter in equation (7) from Nygaard et al. Default = 0.821
- Returns:
Array of hydrodynamic radii for each conformation in the ensemble. If return_mean is set to true returns the mean value as a float
- Return type:
np.array or float
- build_ensemble_trajectory(batch_size=100, num_cpus_mds=2, num_mds_init=4, device=None, force_recompute=False, progress_bar=True)[source]
Function that explicitly reconstructs a 3D ensemble of conformations using the distance maps. This happens automatically if the trajectory property is called, but this function allows for more control over the process. Specifically it allows you to specify the method used to generate the 3D structures, the number of CPUs to use, and the device to use for predictions. Note that if the 3D ensemble has already been reconstructed this function will NOT reconstructed the 3D ensemble unless force_recompute is set to True.
- Parameters:
num_cpus_mds (int) – The number of CPUs to use for MDS. Default is 4 (set by configs.DEFAULT_CPU_COUNT_MDS)
num_mds_init (int) – Number of independent MDS jobs to execute. NB: if this is increased this in principle means there are more chances of finding a good solution, but there is a performance hit unless num_cpus_mds >= num_mds_init. Default is 4 (set by configs.DEFAULT_MDS_NUM_INIT).
device (str) – The device to use for predictions. Default is None. If None, the default device will be used. Default device is ‘gpu’. This is MPS for apple silicon and CUDA for all other devices. If MPS and CUDA are not available, automatically falls back to CPU.
force_recompute (bool) – If True, forces recomputation of the ensemble trajectory, otherwise uses the cached trajectory if previously computed. Default is False.
progress_bar (bool) – If True, displays a progress bar when generating the ensemble trajectory. Default is True.
- Returns:
The ensemble trajectory as a SOURSOP Trajectory object. Note that this object
- Return type:
soursop.sstrajectory.SSTrajectory
- property has_structures
Check if the ensemble has 3D structures.
- Returns:
True if the ensemble has 3D structures, False otherwise.
- Return type:
- property trajectory
Return the ensemble trajectory.
- Returns:
The ensemble trajectory as a SOURSOP Trajectory object.
- Return type:
soursop.sstrajectory.SSTrajectory
- save(filename_prefix, compress=False, reduce_precision=None, compression_algorithm='lzma', verbose=True)[source]
Save the ensemble to a file in the STARLING format. Note this will add the .starling extension to the filename if not provided and will strip of any existing extension passed.
- Parameters:
filename_prefix (str) – The name of the file to save the ensemble to, excluding file extensions which are added automatically. Note that if you provide a file extension it will be stripped off.
compress (bool) – Whether to compress the file or not. Default is False.
reduce_precision (bool) – Whether to reduce the precision of the distance map to a single decimal point and cast to float16 if possible. Default is None, and then sets to False if compression is False, but True if compression is True. However it can be manually over-ridden.
compression_algorithm (str) – The compression algorithm to use. Options are ‘gzip’ and ‘lzma’. lzma gives better compression if reduce_precision is set to True, but actually ‘gzip’ is better if reduce_precision is False. ‘lzma’ is also slower than ‘gzip’. Default is ‘lzma’.
verbose (bool) – Flag to define how noisy we should be
- save_trajectory(filename_prefix, pdb_trajectory=False)[source]
Save the ensemble trajectory to a file. This ONLY saves the 3D structural ensemble but does not save the STARLING-generated distance maps. We recommend using save() instead to save the full STARLING object.
- Parameters:
filename (str) – The name of the file to save the trajectory to, excluding file extensions which are added automatically.
pdb_trajectory (bool) – If set to True, the output trajectory is ONLY saved as a PDB file. If set to false, it is saved as a single PDB structure for topology and then the actual trajectory as an XTC file.
- reweight_bme(observables, calculated_values, theta=None, theta_range=(0.01, 10.0), theta_n_points=15, max_iterations=50000, optimizer='L-BFGS-B', initial_weights=None, force_recompute=False, verbose=True, show=False, save_theta_scan_plot=None, theta_method: str = 'perpendicular')[source]
Perform Bayesian Maximum Entropy (BME) reweighting of the ensemble to better match experimental observables while minimizing bias.
This method creates optimized weights for each frame that balance fitting experimental data with maintaining ensemble diversity. The result is cached and can be used with other ensemble methods by setting use_bme_weights=True.
- Parameters:
observables (list[ExperimentalObservable]) – List of experimental observables to fit. Each observable should be an ExperimentalObservable object with value, uncertainty, and constraint type.
calculated_values (np.ndarray) – Array of calculated observable values for each frame. Shape: (n_frames, n_observables)
theta (float or None, optional) –
Regularization parameter controlling the trade-off between fitting experimental data and maintaining ensemble diversity. Higher values prefer more diverse ensembles.
If None (default): Automatically determine optimal theta using L-curve analysis
If float: Use the specified theta value directly
Default is None (automatic selection).
theta_range (tuple, optional) – Range for automatic theta scan: (min_theta, max_theta). Only used when theta=None. Default is (0.01, 10.0).
theta_n_points (int, optional) – Number of theta values to test during automatic scan. Only used when theta=None. Default is 30.
max_iterations (int, optional) – Maximum number of optimization iterations. Default is 50000.
optimizer (str, optional) – Optimization method to use. Default is ‘L-BFGS-B’.
initial_weights (np.ndarray, optional) – Initial weights for ensemble frames. If None, uniform weights are used.
force_recompute (bool, optional) – If True, forces recomputation even if BME has been previously performed. Default is False.
verbose (bool, optional) – If True, print optimization progress. Default is True.
save_theta_scan_plot (str, optional) – Path to save theta scan plot (only used when theta=None / auto mode). If None, plot is not saved. Default is None.
show (bool, optional) – If True and theta=None (auto mode), display the theta-scan plot. Default is False.
theta_method (str, optional) –
Method to select optimal theta during the scan (only used when theta=None). Options:
’perpendicular’ (default): knee by max perpendicular distance to chord
’curvature’: knee by Menger curvature
- Returns:
Object containing optimization results including weights, chi-squared, and convergence information. When theta=None, the result also includes theta_scan_result attribute with details of the automatic selection.
- Return type:
Examples
>>> from starling.structure.bme import ExperimentalObservable >>> >>> # Example 1: Automatic theta selection (recommended) >>> obs1 = ExperimentalObservable(25.0, 2.0, name="Rg") >>> obs2 = ExperimentalObservable(70.0, 5.0, constraint="upper", name="ETE") >>> >>> rg_values = ensemble.radius_of_gyration() >>> ete_values = ensemble.end_to_end_distance() >>> calculated = np.column_stack([rg_values, ete_values]) >>> >>> # Automatically finds and uses optimal theta >>> result = ensemble.reweight_bme([obs1, obs2], calculated) >>> print(f"Auto-selected theta: {result.metadata['theta_used']}") >>> >>> # Example 2: Manual theta specification >>> result = ensemble.reweight_bme([obs1, obs2], calculated, theta=0.5) >>> >>> # Example 3: Custom theta scan range >>> result = ensemble.reweight_bme( ... [obs1, obs2], calculated, ... theta=None, # auto mode ... theta_range=(0.1, 5.0), ... theta_n_points=20, ... save_theta_scan_plot="my_theta_scan.png" ... ) >>> >>> # Now use BME weights in other calculations >>> reweighted_rg = ensemble.radius_of_gyration(use_bme_weights=True)
Notes
Automatic Theta Selection (theta=None)
When theta=None, the method performs an L-curve analysis to find the optimal theta that balances fit quality (Chi squared) and ensemble diversity (phi). This uses Menger curvature to find the “knee” point of the L-curve.
The automatic selection: 1. Tests multiple theta values across the specified range 2. Computes Chi squared and phi for each theta 3. Finds the optimal balance using L-curve analysis 4. Uses the optimal theta for final reweighting 5. Stores scan results in result.metadata[‘theta_scan_result’]
Manual Theta Selection (theta=<value>)
When you specify theta explicitly, no scan is performed and the method uses your value directly. This is faster but requires you to choose theta appropriately for your system.
Which Should You Use?
Use theta=None (auto) for most applications, especially when: * You’re unsure what theta value to use * You want reproducible, principled parameter selection * You’re willing to wait ~30x longer (30 BME fits instead of 1)
Use theta=<value> (manual) when: * You’ve already determined an appropriate theta value * You need fast reweighting (e.g., in a loop) * You’re doing sensitivity analysis with specific theta values
- view_theta_scan(save_path: str | None = None, show: bool = True, figsize: Tuple[int, int] = (18, 10))[source]
Visualize the most recent theta scan diagnostics for this Ensemble.
This wraps ThetaScanResult.plot() from the last BME fit that used automatic theta selection.
- Parameters:
- Returns:
The created figure.
- Return type:
matplotlib.figure.Figure
- Raises:
ValueError – If no theta scan has been performed yet.
- property has_bme_weights
Check if BME reweighting has been performed.
- Returns:
True if BME reweighting has been performed, False otherwise.
- Return type:
- property bme_result
Get the cached BME result.
- Returns:
The BME optimization result, or None if reweight_bme() has not been called.
- Return type:
BMEResult or None
- property theta_scan_result
Theta scan result from the most recent BME fit (if auto-theta was used).
- Return type:
ThetaScanResult or None