Skip to main content
When something is wrong, inspect the pipeline stage-by-stage rather than guessing from runtime output. The full pipeline is:
Python source → traced IR → MLIR → Metal source → .metallib → GPU output

Enable debug output at compile time

compiled = enigma.compile(
    my_kernel,
    dump_ir=True,           # print traced IR
    dump_mlir=True,         # print MLIR
    keep_metal_source=True, # save .metal source
    work_dir="./build/debug",
)

What to look for in the IR (dump_ir=True)

The IR lists every operation the tracer recorded. Verify:
  • Loads and stores appear in the right order
  • Thread index ops (thread_position_in_grid, etc.) are present
  • Arithmetic ops match your kernel logic
  • No unexpected extra ops (can indicate a scoping issue in for_range)

What to look for in MLIR (dump_mlir=True)

The MLIR representation shows the dialect ops before Metal emission. Verify:
  • Buffer arguments have the correct types (!enigma.buffer<f32>)
  • Control-flow regions (scf.for, scf.if) have the correct structure
  • Thread index ops map to the correct Dimension attribute (x/y/z)

Inspecting the Metal source

print(compiled.metal_source)
Key things to check:
  • Kernel function signature: buffer count and types match kernel parameters
  • Index arithmetic: confirm the generated expressions match your intent
  • [[thread_position_in_grid]] vs [[thread_position_in_threadgroup]]: correct attribute?
  • Barrier placement: threadgroup_barrier appears between write and read of shared memory

Saving Metal source for diffing

compiled_before = enigma.compile(kernel_v1, keep_metal_source=True, work_dir="./build/v1")
compiled_after  = enigma.compile(kernel_v2, keep_metal_source=True, work_dir="./build/v2")
compiled_before.export_metal("./build/v1/kernel.metal")
compiled_after.export_metal("./build/v2/kernel.metal")
Then diff:
diff ./build/v1/kernel.metal ./build/v2/kernel.metal
This is the fastest way to catch regressions in codegen.

Minimal reproducible inputs

When diagnosing a correctness issue:
  1. Use the smallest input that reproduces the problem (e.g. n=32)
  2. Use np.arange(n).astype(np.float32) as input so values are predictable
  3. Check each element individually with out[i] rather than aggregate assertions
n = 32
a = np.arange(n, dtype=np.float32)
b = np.ones(n, dtype=np.float32)

raw = rt.execute(compiled, [a, b], n * 4, grid=(n, 1, 1), threads=(32, 1, 1))
out = np.frombuffer(raw, dtype=np.float32)

for i in range(n):
    expected = a[i] + b[i]
    if abs(out[i] - expected) > 1e-5:
        print(f"[{i}] got {out[i]:.4f}, expected {expected:.4f}")

Debugging layout algebra

When a tiling produces unexpected output, print shape and stride at each transform:
thr = enigma.make_ordered_layout((4, 64), order=(1, 0))
val = enigma.make_ordered_layout((4, 4), order=(1, 0))
tiler_mn, tv_layout = enigma.make_layout_tv(thr, val)

print("thr layout:", thr)
print("val layout:", val)
print("tiler_mn:  ", tiler_mn)
print("tv_layout: ", tv_layout)
print("threads:   ", enigma.size(tv_layout, mode=[0]))
print("vals/thr:  ", enigma.size(tv_layout, mode=[1]))
Confirm:
  • threads × vals/thr = total elements in one tile
  • Tiler dimensions fit within the tensor dimensions

Fast debugging loop

  1. Lock a minimal reproducible input (small n, predictable values)
  2. Compile with dump_ir=True, dump_mlir=True, keep_metal_source=True
  3. Read the Metal source — the bug is usually visible there
  4. Make one change at a time and re-check the Metal source before dispatching
  5. Only run dispatch once the Metal source looks correct