Inspection and Tracing

When something is wrong, inspect the pipeline stage-by-stage rather than guessing from runtime output. The full pipeline is:

Python source → traced IR → MLIR → Metal source → .metallib → GPU output

Enable debug output at compile time

compiled = enigma.compile(
    my_kernel,
    dump_ir=True,           # print traced IR
    dump_mlir=True,         # print MLIR
    keep_metal_source=True, # save .metal source
    work_dir="./build/debug",
)

What to look for in the IR (`dump_ir=True`)

The IR lists every operation the tracer recorded. Verify:

Loads and stores appear in the right order
Thread index ops (thread_position_in_grid, etc.) are present
Arithmetic ops match your kernel logic
No unexpected extra ops (can indicate a scoping issue in for_range)

What to look for in MLIR (`dump_mlir=True`)

The MLIR representation shows the dialect ops before Metal emission. Verify:

Buffer arguments have the correct types (!enigma.buffer<f32>)
Control-flow regions (scf.for, scf.if) have the correct structure
Thread index ops map to the correct Dimension attribute (x/y/z)

Inspecting the Metal source

print(compiled.metal_source)

Key things to check:

Kernel function signature: buffer count and types match kernel parameters
Index arithmetic: confirm the generated expressions match your intent
[[thread_position_in_grid]] vs [[thread_position_in_threadgroup]]: correct attribute?
Barrier placement: threadgroup_barrier appears between write and read of shared memory

Saving Metal source for diffing

compiled_before = enigma.compile(kernel_v1, keep_metal_source=True, work_dir="./build/v1")
compiled_after  = enigma.compile(kernel_v2, keep_metal_source=True, work_dir="./build/v2")
compiled_before.export_metal("./build/v1/kernel.metal")
compiled_after.export_metal("./build/v2/kernel.metal")

Then diff:

diff ./build/v1/kernel.metal ./build/v2/kernel.metal

This is the fastest way to catch regressions in codegen.

Minimal reproducible inputs

When diagnosing a correctness issue:

Use the smallest input that reproduces the problem (e.g. n=32)
Use np.arange(n).astype(np.float32) as input so values are predictable
Check each element individually with out[i] rather than aggregate assertions

n = 32
a = np.arange(n, dtype=np.float32)
b = np.ones(n, dtype=np.float32)

raw = rt.execute(compiled, [a, b], n * 4, grid=(n, 1, 1), threads=(32, 1, 1))
out = np.frombuffer(raw, dtype=np.float32)

for i in range(n):
    expected = a[i] + b[i]
    if abs(out[i] - expected) > 1e-5:
        print(f"[{i}] got {out[i]:.4f}, expected {expected:.4f}")

Debugging layout algebra

When a tiling produces unexpected output, print shape and stride at each transform:

thr = enigma.make_ordered_layout((4, 64), order=(1, 0))
val = enigma.make_ordered_layout((4, 4), order=(1, 0))
tiler_mn, tv_layout = enigma.make_layout_tv(thr, val)

print("thr layout:", thr)
print("val layout:", val)
print("tiler_mn:  ", tiler_mn)
print("tv_layout: ", tv_layout)
print("threads:   ", enigma.size(tv_layout, mode=[0]))
print("vals/thr:  ", enigma.size(tv_layout, mode=[1]))

Confirm:

threads × vals/thr = total elements in one tile
Tiler dimensions fit within the tensor dimensions

Fast debugging loop

Lock a minimal reproducible input (small n, predictable values)
Compile with dump_ir=True, dump_mlir=True, keep_metal_source=True
Read the Metal source — the bug is usually visible there
Make one change at a time and re-check the Metal source before dispatching
Only run dispatch once the Metal source looks correct

​Enable debug output at compile time

​What to look for in the IR (dump_ir=True)

​What to look for in MLIR (dump_mlir=True)

​Inspecting the Metal source

​Saving Metal source for diffing

​Minimal reproducible inputs

​Debugging layout algebra

​Fast debugging loop