Performance Optimization

STARLING can be accelerated through PyTorch compilation to achieve faster sampling throughput on repeated runs. Use this page to understand compilation options and optimize your workflow for high-throughput applications.

Overview

PyTorch’s torch.compile infrastructure can dramatically speed up ensemble generation by:

Optimizing computation graphs – reducing Python overhead
Fusing operations – combining multiple kernels into efficient sequences
Caching compiled models – amortizing compilation cost across runs

STARLING caches compiled models between calls, so the compilation overhead is paid once and benefits all subsequent sampling jobs.

Basic Usage

Enable compilation with starling.set_compilation_options():

import starling

# Enable compilation with default settings
starling.set_compilation_options(enabled=True)

# Generate ensembles - first call compiles, subsequent calls are faster
for sequence in sequences:
    ensemble = starling.generate(sequence, conformations=200)

The first call will take longer due to compilation, but subsequent calls will be significantly faster.

Compilation Modes

PyTorch supports several compilation modes that trade off compilation time for runtime performance:

"default"

Balanced mode suitable for most cases. Good speedup with reasonable compilation time.

"reduce-overhead"

Recommended for STARLING. Optimizes for minimal Python overhead and fast execution. Best for repeated sampling runs.

starling.set_compilation_options(
    enabled=True,
    mode="reduce-overhead"
)

"max-autotune"

Extensive tuning for maximum performance. Takes longer to compile but produces the fastest code. Use for production workloads with fixed sequences.

starling.set_compilation_options(
    enabled=True,
    mode="max-autotune"
)

Backend Selection

The compilation backend determines how PyTorch optimizes and executes your models:

"inductor" (default)

Modern TorchInductor backend with excellent performance on both CPU and GPU. Supports most PyTorch operations and provides strong speedups.

starling.set_compilation_options(
    enabled=True,
    backend="inductor"
)

"cudagraphs" (GPU only)

Captures and replays entire CUDA execution graphs. Can provide additional speedup on GPUs for fixed-shape workloads.

Advanced Options

Full Configuration Example

import starling

starling.set_compilation_options(
    enabled=True,
    mode="reduce-overhead",
    backend="inductor",
    fullgraph=False,          # Allow graph breaks
    dynamic=False,            # Fixed tensor shapes
    options={
        "triton.cudagraphs": True,  # Backend-specific options
    }
)

Common Options

fullgraphbool, default False: If True, requires the entire model to compile as a single graph. Compilation may fail if the model contains unsupported operations. Set to False to allow graph breaks.
dynamicbool, default None: Controls dynamic shape support. Set to False for fixed-shape workloads (faster) or True for variable-shape inputs.
optionsdict, optional: Backend-specific configuration. See PyTorch documentation for details.

Disabling Compilation

To restore eager execution mode:

starling.set_compilation_options(enabled=False)

This is useful for debugging or when compilation is causing issues.

Performance Tips

Warm-up runs: The first generation after enabling compilation will be slower due to compilation overhead. Consider a warm-up run before timing.
Batch similar sequences: Compilation is most effective when processing sequences of similar length in succession.
Fixed conformations count: Keeping the number of conformations constant across runs improves cache hits.
GPU utilization: Compilation benefits are most pronounced on GPUs where kernel fusion and memory access optimization provide significant gains.

Profile first: Use PyTorch profiling tools to identify bottlenecks before enabling compilation:

import torch.profiler

with torch.profiler.profile() as prof:
    ensemble = starling.generate(sequence, conformations=200)

print(prof.key_averages().table(sort_by="cuda_time_total"))

Benchmarking Example

Compare performance with and without compilation:

import time
import starling

sequence = "MQDRVKRPMNAFIVWSRDQRRKMALENPRMRNSEISKQLGYQWKMLTEAEKWPFFQEAQKLQAMHREKYPNYKYRPRRKAKMLPK"

# Baseline: eager mode
starling.set_compilation_options(enabled=False)
start = time.time()
for _ in range(10):
    ensemble = starling.generate(sequence, conformations=100)
eager_time = time.time() - start

# Compiled mode
starling.set_compilation_options(enabled=True, mode="reduce-overhead")
start = time.time()
for _ in range(10):
    ensemble = starling.generate(sequence, conformations=100)
compiled_time = time.time() - start

print(f"Eager mode: {eager_time:.2f}s")
print(f"Compiled mode: {compiled_time:.2f}s")
print(f"Speedup: {eager_time/compiled_time:.2f}x")

Troubleshooting

Compilation Failures

If you encounter compilation errors:

Try disabling fullgraph:

starling.set_compilation_options(
    enabled=True,
    fullgraph=False
)

Use "default" mode instead of "reduce-overhead"
Check PyTorch version – compilation support improves in newer releases

Slower Than Expected

If compilation doesn’t improve performance:

Ensure you’re running multiple iterations (compilation overhead is paid once)
Check that you’re using a GPU (CPU compilation benefits are smaller)
Verify tensor shapes are consistent across runs
Profile to identify non-compiled bottlenecks

Memory Issues

Compilation can increase memory usage:

Reduce batch size or conformations count
Use mode="default" instead of "max-autotune"
Monitor GPU memory with nvidia-smi