Execution Model

Enigma uses Metal’s explicit grid-and-threadgroup execution model. You control the full launch geometry at dispatch time — Enigma never infers it from data shapes.

Grid and threadgroups

A kernel launch is defined by two parameters:

grid=(gx, gy, gz) — total number of threads in each dimension across the entire dispatch
threads=(tx, ty, tz) — number of threads per threadgroup in each dimension

Metal schedules grid / threads threadgroups. Each threadgroup runs concurrently on a single GPU core, sharing threadgroup (shared) memory.

Grid (gx=8192, gy=1, gz=1)
├── Threadgroup 0 (tx=256 threads)
├── Threadgroup 1 (tx=256 threads)
├── ...
└── Threadgroup 31 (tx=256 threads)

Thread index queries

Inside @enigma.kernel, use these functions to obtain thread coordinates as IRValue objects:

Global position (most common)

tid = enigma.thread_position_in_grid                    # x-dimension shorthand
x   = enigma.thread_position_in_grid_xyz("x")           # explicit x
y   = enigma.thread_position_in_grid_xyz("y")
z   = enigma.thread_position_in_grid_xyz("z")

Threadgroup-relative position

local_id  = enigma.thread_position_in_threadgroup("x")  # position within threadgroup
group_id  = enigma.threadgroup_position_in_grid("x")    # which threadgroup

SIMD group queries

simd_idx  = enigma.thread_index_in_simdgroup()          # lane index [0, 31]
simd_gid  = enigma.simdgroup_index_in_threadgroup()     # which simd group [0, N-1]
simd_size = enigma.threads_per_simdgroup()              # typically 32
tg_size   = enigma.threads_per_threadgroup()            # tx * ty * tz

Launch patterns

1D — elementwise

For kernels where each thread handles one element:

@enigma.kernel
def elementwise(A: enigma.f32, B: enigma.f32, C: enigma.f32):
    tid = enigma.thread_position_in_grid
    C[tid] = A[tid] + B[tid]

Dispatch:

n = 65536
rt.execute(compiled, [a, b], output_size=n * 4, grid=(n, 1, 1), threads=(256, 1, 1))

2D — matrix kernels

For kernels operating on 2D data, use y for rows and x for columns (standard convention):

@enigma.kernel
def matscale(A: enigma.f32, C: enigma.f32):
    row = enigma.thread_position_in_grid_xyz("y")
    col = enigma.thread_position_in_grid_xyz("x")
    idx = row * 512 + col   # row-major, cols=512
    C[idx] = A[idx] * 2.0

Dispatch:

rows, cols = 256, 512
rt.execute(compiled, [a], output_size=rows * cols * 4,
           grid=(cols, rows, 1), threads=(16, 16, 1))

Vectorized — `vec_width`

When you compile with vec_width=4, each thread processes 4 elements. Divide the grid accordingly:

compiled = enigma.compile(add, vec_width=4)
rt.execute(compiled, [a, b], output_size=n * 4,
           grid=(n // 4, 1, 1), threads=(256, 1, 1))

Launch configuration rules

Every logical element must be covered by exactly one thread.
output_size is in bytes, not elements. For f32: elements × 4.
Threadgroup size should be a multiple of the SIMD group size (32) for best occupancy.
For padded domains, use predicated stores (store_if) to avoid writing out-of-bounds.

Threadgroup size guidelines

Pattern	Recommended `threads`
1D elementwise	`(256, 1, 1)`
2D tiled	`(16, 16, 1)`
SIMD reduction	`(256, 1, 1)` — 8 SIMD groups
Shared memory kernel	Tune based on `max_threads_per_threadgroup` from device caps

Device limits

Query capabilities at runtime:

caps = rt.device_capabilities()
print(caps.max_threads_per_threadgroup)   # typically 1024
print(caps.simdgroup_size)               # typically 32

​Grid and threadgroups

​Thread index queries

​Global position (most common)

​Threadgroup-relative position

​SIMD group queries

​Launch patterns

​1D — elementwise

​2D — matrix kernels

​Vectorized — vec_width

​Launch configuration rules

​Threadgroup size guidelines

​Device limits