Performance Optimization
STARLING can be accelerated through PyTorch compilation to achieve faster sampling throughput on repeated runs. Use this page to understand compilation options and optimize your workflow for high-throughput applications.
Overview
PyTorch’s torch.compile infrastructure can dramatically speed up ensemble
generation by:
Optimizing computation graphs – reducing Python overhead
Fusing operations – combining multiple kernels into efficient sequences
Caching compiled models – amortizing compilation cost across runs
STARLING caches compiled models between calls, so the compilation overhead is paid once and benefits all subsequent sampling jobs.
Basic Usage
Enable compilation with starling.set_compilation_options():
import starling
# Enable compilation with default settings
starling.set_compilation_options(enabled=True)
# Generate ensembles - first call compiles, subsequent calls are faster
for sequence in sequences:
ensemble = starling.generate(sequence, conformations=200)
The first call will take longer due to compilation, but subsequent calls will be significantly faster.
Compilation Modes
PyTorch supports several compilation modes that trade off compilation time for runtime performance:
"default"Balanced mode suitable for most cases. Good speedup with reasonable compilation time.
"reduce-overhead"Recommended for STARLING. Optimizes for minimal Python overhead and fast execution. Best for repeated sampling runs.
starling.set_compilation_options( enabled=True, mode="reduce-overhead" )
"max-autotune"Extensive tuning for maximum performance. Takes longer to compile but produces the fastest code. Use for production workloads with fixed sequences.
starling.set_compilation_options( enabled=True, mode="max-autotune" )
Backend Selection
The compilation backend determines how PyTorch optimizes and executes your models:
"inductor"(default)Modern TorchInductor backend with excellent performance on both CPU and GPU. Supports most PyTorch operations and provides strong speedups.
starling.set_compilation_options( enabled=True, backend="inductor" )
"cudagraphs"(GPU only)Captures and replays entire CUDA execution graphs. Can provide additional speedup on GPUs for fixed-shape workloads.
Advanced Options
Full Configuration Example
import starling
starling.set_compilation_options(
enabled=True,
mode="reduce-overhead",
backend="inductor",
fullgraph=False, # Allow graph breaks
dynamic=False, # Fixed tensor shapes
options={
"triton.cudagraphs": True, # Backend-specific options
}
)
Common Options
fullgraphbool, default FalseIf
True, requires the entire model to compile as a single graph. Compilation may fail if the model contains unsupported operations. Set toFalseto allow graph breaks.dynamicbool, default NoneControls dynamic shape support. Set to
Falsefor fixed-shape workloads (faster) orTruefor variable-shape inputs.optionsdict, optionalBackend-specific configuration. See PyTorch documentation for details.
Disabling Compilation
To restore eager execution mode:
starling.set_compilation_options(enabled=False)
This is useful for debugging or when compilation is causing issues.
Performance Tips
Warm-up runs: The first generation after enabling compilation will be slower due to compilation overhead. Consider a warm-up run before timing.
Batch similar sequences: Compilation is most effective when processing sequences of similar length in succession.
Fixed conformations count: Keeping the number of conformations constant across runs improves cache hits.
GPU utilization: Compilation benefits are most pronounced on GPUs where kernel fusion and memory access optimization provide significant gains.
Profile first: Use PyTorch profiling tools to identify bottlenecks before enabling compilation:
import torch.profiler with torch.profiler.profile() as prof: ensemble = starling.generate(sequence, conformations=200) print(prof.key_averages().table(sort_by="cuda_time_total"))
Benchmarking Example
Compare performance with and without compilation:
import time
import starling
sequence = "MQDRVKRPMNAFIVWSRDQRRKMALENPRMRNSEISKQLGYQWKMLTEAEKWPFFQEAQKLQAMHREKYPNYKYRPRRKAKMLPK"
# Baseline: eager mode
starling.set_compilation_options(enabled=False)
start = time.time()
for _ in range(10):
ensemble = starling.generate(sequence, conformations=100)
eager_time = time.time() - start
# Compiled mode
starling.set_compilation_options(enabled=True, mode="reduce-overhead")
start = time.time()
for _ in range(10):
ensemble = starling.generate(sequence, conformations=100)
compiled_time = time.time() - start
print(f"Eager mode: {eager_time:.2f}s")
print(f"Compiled mode: {compiled_time:.2f}s")
print(f"Speedup: {eager_time/compiled_time:.2f}x")
Troubleshooting
Compilation Failures
If you encounter compilation errors:
Try disabling
fullgraph:starling.set_compilation_options( enabled=True, fullgraph=False )
Use
"default"mode instead of"reduce-overhead"Check PyTorch version – compilation support improves in newer releases
Slower Than Expected
If compilation doesn’t improve performance:
Ensure you’re running multiple iterations (compilation overhead is paid once)
Check that you’re using a GPU (CPU compilation benefits are smaller)
Verify tensor shapes are consistent across runs
Profile to identify non-compiled bottlenecks
Memory Issues
Compilation can increase memory usage:
Reduce batch size or conformations count
Use
mode="default"instead of"max-autotune"Monitor GPU memory with
nvidia-smi
See Also
Ensemble Generation – Core sampling workflows and options
Guided Sampling with Constraints – Physics-based guidance during sampling
starling.set_compilation_options()– API reference