Skip to main content
Enigma uses Metal’s explicit grid-and-threadgroup execution model. You control the full launch geometry at dispatch time — Enigma never infers it from data shapes.

Grid and threadgroups

A kernel launch is defined by two parameters:
  • grid=(gx, gy, gz) — total number of threads in each dimension across the entire dispatch
  • threads=(tx, ty, tz) — number of threads per threadgroup in each dimension
Metal schedules grid / threads threadgroups. Each threadgroup runs concurrently on a single GPU core, sharing threadgroup (shared) memory.
Grid (gx=8192, gy=1, gz=1)
├── Threadgroup 0 (tx=256 threads)
├── Threadgroup 1 (tx=256 threads)
├── ...
└── Threadgroup 31 (tx=256 threads)

Thread index queries

Inside @enigma.kernel, use these functions to obtain thread coordinates as IRValue objects:

Global position (most common)

tid = enigma.thread_position_in_grid                    # x-dimension shorthand
x   = enigma.thread_position_in_grid_xyz("x")           # explicit x
y   = enigma.thread_position_in_grid_xyz("y")
z   = enigma.thread_position_in_grid_xyz("z")

Threadgroup-relative position

local_id  = enigma.thread_position_in_threadgroup("x")  # position within threadgroup
group_id  = enigma.threadgroup_position_in_grid("x")    # which threadgroup

SIMD group queries

simd_idx  = enigma.thread_index_in_simdgroup()          # lane index [0, 31]
simd_gid  = enigma.simdgroup_index_in_threadgroup()     # which simd group [0, N-1]
simd_size = enigma.threads_per_simdgroup()              # typically 32
tg_size   = enigma.threads_per_threadgroup()            # tx * ty * tz

Launch patterns

1D — elementwise

For kernels where each thread handles one element:
@enigma.kernel
def elementwise(A: enigma.f32, B: enigma.f32, C: enigma.f32):
    tid = enigma.thread_position_in_grid
    C[tid] = A[tid] + B[tid]
Dispatch:
n = 65536
rt.execute(compiled, [a, b], output_size=n * 4, grid=(n, 1, 1), threads=(256, 1, 1))

2D — matrix kernels

For kernels operating on 2D data, use y for rows and x for columns (standard convention):
@enigma.kernel
def matscale(A: enigma.f32, C: enigma.f32):
    row = enigma.thread_position_in_grid_xyz("y")
    col = enigma.thread_position_in_grid_xyz("x")
    idx = row * 512 + col   # row-major, cols=512
    C[idx] = A[idx] * 2.0
Dispatch:
rows, cols = 256, 512
rt.execute(compiled, [a], output_size=rows * cols * 4,
           grid=(cols, rows, 1), threads=(16, 16, 1))

Vectorized — vec_width

When you compile with vec_width=4, each thread processes 4 elements. Divide the grid accordingly:
compiled = enigma.compile(add, vec_width=4)
rt.execute(compiled, [a, b], output_size=n * 4,
           grid=(n // 4, 1, 1), threads=(256, 1, 1))

Launch configuration rules

  • Every logical element must be covered by exactly one thread.
  • output_size is in bytes, not elements. For f32: elements × 4.
  • Threadgroup size should be a multiple of the SIMD group size (32) for best occupancy.
  • For padded domains, use predicated stores (store_if) to avoid writing out-of-bounds.

Threadgroup size guidelines

PatternRecommended threads
1D elementwise(256, 1, 1)
2D tiled(16, 16, 1)
SIMD reduction(256, 1, 1) — 8 SIMD groups
Shared memory kernelTune based on max_threads_per_threadgroup from device caps

Device limits

Query capabilities at runtime:
caps = rt.device_capabilities()
print(caps.max_threads_per_threadgroup)   # typically 1024
print(caps.simdgroup_size)               # typically 32