Skip to main content

Compilation failures

xcrun: error: unable to find utility "metal"

The Metal compiler is not in your PATH. This almost always means Xcode Command Line Tools are not installed or need to be re-selected.
# Install or reinstall
xcode-select --install

# Verify
xcrun -sdk macosx metal -v
If xcrun is present but the metal tool is missing, try:
sudo xcode-select --reset

Tracer TypeError

TypeError: unsupported operand type(s) for +: 'int' and 'IRValue'
You are mixing a Python value with an IRValue in a way the tracer cannot handle. Make sure all kernel-body computation goes through IRValue objects. Integer and float literals are fine — Python variables holding IRValue nodes are fine — but numpy arrays, Python lists, or conditionals on IRValue objects are not. Wrong:
@enigma.kernel
def bad(A: enigma.f32, C: enigma.f32):
    n = 1024            # Python int
    tid = enigma.thread_position_in_grid
    if tid < n:         # comparing IRValue with Python int — not allowed
        C[tid] = A[tid]
Correct:
@enigma.kernel
def good(A: enigma.f32, C: enigma.f32):
    tid = enigma.thread_position_in_grid
    with enigma.if_(enigma.cmp_lt(tid, 1024)):
        C[tid] = A[tid]

EnigmaError: tiler exceeds tensor in mode N

A layout division produced a tiler dimension larger than the tensor dimension. Print the shape and stride of the tensor and tiler at each step:
print("tensor:", mA.layout)
print("tiler:", tiler_mn)

Dispatch failures

Wrong output shape or garbage values

The most common causes:
SymptomLikely causeFix
Last N elements are zerogrid too smallSet grid=(n, 1, 1) to cover all elements
Values are 2× or 4× wrong offsetWrong stride in index mathCheck row/column ordering in kernel index
Values look like memory garbageoutput_size too smallUse elements * sizeof(dtype) in bytes
Values are consistently off by a constantBuffer bound wrongVerify input array ordering and dtype

output_size gotcha

output_size is in bytes, not elements:
# Wrong
rt.execute(compiled, [a, b], output_size=n, ...)

# Correct (f32 = 4 bytes)
rt.execute(compiled, [a, b], output_size=n * 4, ...)

# Correct (f16 = 2 bytes)
rt.execute(compiled, [a, b], output_size=n * 2, ...)

Runtime dispatch exception

When rt.execute() raises, the error message includes the kernel name, grid, threads, and the Metal return code. Check:
  1. Kernel name matches what you compiled
  2. Grid × threadgroup size covers your data range
  3. Input dtypes match kernel parameter types
Try dispatching with a minimal input (e.g. n=32) to isolate the failure.

Correctness issues

Values differ between GPU and CPU

GPU floating-point order of operations differs from CPU. Use loose tolerances:
np.testing.assert_allclose(gpu_result, cpu_result, rtol=1e-4, atol=1e-4)
For reductions especially, large input arrays accumulate floating-point error. Compare relative error rather than absolute.

Vectorized kernel produces wrong results

When using vec_width=4, the grid must be divided by the vector width:
# Scalar grid
rt.execute(compiled_scalar, ..., grid=(n, 1, 1), ...)

# vec_width=4 grid — 4× fewer threads, each handles 4 elements
rt.execute(compiled_vec4, ..., grid=(n // 4, 1, 1), ...)

2D kernel has transposed output

Enigma uses the Metal convention: x is the fast (column) dimension and y is the slow (row) dimension. Ensure grid and kernel indexing agree:
# Metal convention: grid=(cols, rows, 1)
row = enigma.thread_position_in_grid_xyz("y")
col = enigma.thread_position_in_grid_xyz("x")
idx = row * N + col   # row-major
rt.execute(..., grid=(N, M, 1), threads=(16, 16, 1))