Troubleshooting

Common issues and solutions for tropical-gemm.

Installation Issues

Rust Compilation Errors

Error: “missing SIMD intrinsics”

error[E0433]: failed to resolve: use of undeclared crate or module `core_arch`

Solution: Update Rust to latest stable:

rustup update stable

Error: “target feature avx2 is not enabled”

This is expected on non-x86 platforms. The portable fallback will be used automatically.

CUDA Issues

Error: “CUDA driver not found”

CudaError: CUDA driver version is insufficient

Solution:

Install/update NVIDIA drivers
Verify with nvidia-smi
Install CUDA Toolkit

Error: “nvcc not found”

CudaError: Failed to compile kernel: nvcc not found

Solution:

# Add CUDA to PATH
export PATH=/usr/local/cuda/bin:$PATH

# Verify
nvcc --version

Error: “Kernel compilation failed”

CudaError: CompilationFailed: ...

Solution:

Check CUDA version compatibility (requires 11.0+)
Ensure CUDA headers are installed
Try reinstalling CUDA Toolkit

Python Binding Issues

Error: “module ‘tropical_gemm’ not found”

>>> import tropical_gemm
ModuleNotFoundError: No module named 'tropical_gemm'

Solution:

cd crates/tropical-gemm-python
pip install maturin
maturin develop --release

Error: “symbol not found in flat namespace” (macOS)

ImportError: dlopen(...): symbol not found in flat namespace

Solution: Rebuild with correct Python version:

# Ensure using correct Python
which python
python --version

# Rebuild
maturin develop --release

Error: “dtype mismatch”

TypeError: Expected float32 array, got float64

Solution: Explicitly cast to float32:

import numpy as np
a = a.astype(np.float32)
b = b.astype(np.float32)
c = tropical_gemm.maxplus_matmul(a, b)

Runtime Issues

Incorrect Results

Symptom: All outputs are -inf or inf

This typically means input contains NaN or inf values:

#![allow(unused)]
fn main() {
// Check for invalid values
for &x in data.iter() {
    if x.is_nan() || x.is_infinite() {
        panic!("Invalid input value: {}", x);
    }
}
}

Symptom: Results differ between CPU and GPU

Small numerical differences are expected due to floating-point associativity. For MaxPlus/MinPlus, results should be identical (only comparisons).

For MaxMul, small differences may occur:

#![allow(unused)]
fn main() {
// Allow small tolerance
let diff = (cpu_result - gpu_result).abs();
assert!(diff < 1e-5, "Results differ by {}", diff);
}

Performance Issues

Symptom: GPU slower than CPU

For small matrices, transfer overhead dominates:

#![allow(unused)]
fn main() {
// Rule of thumb: GPU beneficial for N > 256
if n < 256 {
    // Use CPU
    tropical_matmul::<MaxPlus<f32>>(&a, m, k, &b, n)
} else {
    // Use GPU
    tropical_matmul_gpu::<MaxPlus<f32>>(&a, m, k, &b, n)?
}
}

Symptom: CPU slower than expected

Check SIMD detection:

#![allow(unused)]
fn main() {
use tropical_gemm::simd_level;
println!("SIMD level: {:?}", simd_level());
// Should be Avx2 or Avx512 on modern x86
}

Memory Issues

Error: “out of memory” (GPU)

CudaError: Out of memory

Solution:

Use smaller batch sizes
Process matrices sequentially
Free unused GPU memory

#![allow(unused)]
fn main() {
// Process in chunks
for chunk in matrices.chunks(batch_size) {
    let result = process_batch(&ctx, chunk)?;
    // Results are downloaded, GPU memory freed
}
}

Error: “allocation failed” (CPU)

Large matrices may exceed available RAM:

#![allow(unused)]
fn main() {
// Estimate memory needed
let bytes = m * n * std::mem::size_of::<f32>();
println!("Matrix requires {} MB", bytes / 1024 / 1024);
}

PyTorch Issues

Gradient Issues

Symptom: Gradients are all zeros

Check that tensors require gradients:

a = torch.randn(4, 5, requires_grad=True)  # Must be True
b = torch.randn(5, 3, requires_grad=True)

c = TropicalMaxPlusMatmul.apply(a, b)
loss = c.sum()
loss.backward()

print(a.grad)  # Should not be None

Symptom: “RuntimeError: element 0 of tensors does not require grad”

Ensure input tensors have requires_grad=True:

a = torch.tensor([[1.0, 2.0]], requires_grad=True)
# Not: a = torch.tensor([[1.0, 2.0]])  # No gradients!

Device Mismatch

Error: “Expected all tensors on same device”

# Ensure both inputs on same device
a = a.to('cuda')
b = b.to('cuda')
c = TropicalMaxPlusMatmul.apply(a, b)

Getting Help

If you encounter issues not covered here:

Check GitHub issues: https://github.com/TensorBFS/tropical-gemm/issues
Open a new issue with:
- Error message
- Rust/Python version
- OS and hardware
- Minimal reproduction code

Diagnostic Information

Include this in bug reports:

# Rust version
rustc --version
cargo --version

# CUDA (if applicable)
nvcc --version
nvidia-smi

# Python (if applicable)
python --version
pip show tropical_gemm

Keyboard shortcuts

tropical-gemm