Performance Guide

Tips for getting the best performance from omeinsum-rs.

Contraction Order

The most important optimization is contraction order:

#![allow(unused)]
fn main() {
// Always optimize for networks with 3+ tensors
let mut ein = Einsum::new(ixs, iy, sizes);
ein.optimize_greedy();  // or optimize_treesa() for large networks
}

Bad contraction order can be exponentially slower.

Memory Layout

Keep Tensors Contiguous

Non-contiguous tensors require copies before GEMM:

#![allow(unused)]
fn main() {
// After permute, tensor may be non-contiguous
let t_permuted = t.permute(&[1, 0]);

// Make contiguous if you'll use it multiple times
let t_contig = t_permuted.contiguous();
}

Avoid Unnecessary Copies

#![allow(unused)]
fn main() {
// Good: zero-copy view
let view = t.permute(&[1, 0]);

// Avoid: unnecessary explicit copy
let bad = t.permute(&[1, 0]).contiguous();  // Only if needed
}

Parallelization

Enable the parallel feature (default):

[dependencies]
omeinsum = "0.1"  # parallel enabled by default

Disable for single-threaded workloads:

[dependencies]
omeinsum = { version = "0.1", default-features = false }

Data Types

Use f32 When Possible

f32 is typically faster than f64 due to:

Smaller memory bandwidth
Better SIMD utilization

#![allow(unused)]
fn main() {
// Prefer f32
let t = Tensor::<f32, Cpu>::from_data(&data, &shape);

// Use f64 only when precision is critical
let t = Tensor::<f64, Cpu>::from_data(&data, &shape);
}

Benchmarking

Use release mode for benchmarks:

cargo run --release --example basic_einsum

Profile with:

cargo build --release
perf record ./target/release/examples/basic_einsum
perf report

Common Pitfalls

1. Forgetting to Optimize

#![allow(unused)]
fn main() {
// Bad: no optimization
let ein = Einsum::new(ixs, iy, sizes);
let result = ein.execute::<A, T, B>(&tensors);

// Good: with optimization
let mut ein = Einsum::new(ixs, iy, sizes);
ein.optimize_greedy();
let result = ein.execute::<A, T, B>(&tensors);
}

2. Redundant Contiguous Calls

#![allow(unused)]
fn main() {
// Bad: unnecessary copy
let c = a.contiguous().gemm::<Standard<f32>>(&b.contiguous());

// Good: gemm handles this internally
let c = a.gemm::<Standard<f32>>(&b);
}

3. Debug Mode

Debug builds are ~10-50x slower:

# Bad: debug mode
cargo run --example benchmark

# Good: release mode
cargo run --release --example benchmark

Future Optimizations

Planned performance improvements:

CUDA backend for GPU acceleration
Optimized tropical-gemm kernel integration
Batched GEMM support
Cache-aware blocking

Keyboard shortcuts

omeinsum-rs