> ## Documentation Index
> Fetch the complete documentation index at: https://klyne-research.mintlify.site/llms.txt
> Use this file to discover all available pages before exploring further.

# Execution Model

> How Enigma maps kernel launches to Metal grid geometry, threadgroups, and thread indices.

Enigma uses Metal's explicit grid-and-threadgroup execution model. You control the full launch geometry at dispatch time — Enigma never infers it from data shapes.

## Grid and threadgroups

A kernel launch is defined by two parameters:

* **`grid=(gx, gy, gz)`** — total number of threads in each dimension across the entire dispatch
* **`threads=(tx, ty, tz)`** — number of threads per threadgroup in each dimension

Metal schedules `grid / threads` threadgroups. Each threadgroup runs concurrently on a single GPU core, sharing threadgroup (shared) memory.

```
Grid (gx=8192, gy=1, gz=1)
├── Threadgroup 0 (tx=256 threads)
├── Threadgroup 1 (tx=256 threads)
├── ...
└── Threadgroup 31 (tx=256 threads)
```

## Thread index queries

Inside `@enigma.kernel`, use these functions to obtain thread coordinates as `IRValue` objects:

### Global position (most common)

```python theme={null}
tid = enigma.thread_position_in_grid                    # x-dimension shorthand
x   = enigma.thread_position_in_grid_xyz("x")           # explicit x
y   = enigma.thread_position_in_grid_xyz("y")
z   = enigma.thread_position_in_grid_xyz("z")
```

### Threadgroup-relative position

```python theme={null}
local_id  = enigma.thread_position_in_threadgroup("x")  # position within threadgroup
group_id  = enigma.threadgroup_position_in_grid("x")    # which threadgroup
```

### SIMD group queries

```python theme={null}
simd_idx  = enigma.thread_index_in_simdgroup()          # lane index [0, 31]
simd_gid  = enigma.simdgroup_index_in_threadgroup()     # which simd group [0, N-1]
simd_size = enigma.threads_per_simdgroup()              # typically 32
tg_size   = enigma.threads_per_threadgroup()            # tx * ty * tz
```

## Launch patterns

### 1D — elementwise

For kernels where each thread handles one element:

```python theme={null}
@enigma.kernel
def elementwise(A: enigma.f32, B: enigma.f32, C: enigma.f32):
    tid = enigma.thread_position_in_grid
    C[tid] = A[tid] + B[tid]
```

Dispatch:

```python theme={null}
n = 65536
rt.execute(compiled, [a, b], output_size=n * 4, grid=(n, 1, 1), threads=(256, 1, 1))
```

### 2D — matrix kernels

For kernels operating on 2D data, use `y` for rows and `x` for columns (standard convention):

```python theme={null}
@enigma.kernel
def matscale(A: enigma.f32, C: enigma.f32):
    row = enigma.thread_position_in_grid_xyz("y")
    col = enigma.thread_position_in_grid_xyz("x")
    idx = row * 512 + col   # row-major, cols=512
    C[idx] = A[idx] * 2.0
```

Dispatch:

```python theme={null}
rows, cols = 256, 512
rt.execute(compiled, [a], output_size=rows * cols * 4,
           grid=(cols, rows, 1), threads=(16, 16, 1))
```

### Vectorized — `vec_width`

When you compile with `vec_width=4`, each thread processes 4 elements. Divide the grid accordingly:

```python theme={null}
compiled = enigma.compile(add, vec_width=4)
rt.execute(compiled, [a, b], output_size=n * 4,
           grid=(n // 4, 1, 1), threads=(256, 1, 1))
```

## Launch configuration rules

* Every logical element must be covered by exactly one thread.
* `output_size` is in **bytes**, not elements. For `f32`: `elements × 4`.
* Threadgroup size should be a multiple of the SIMD group size (32) for best occupancy.
* For padded domains, use predicated stores (`store_if`) to avoid writing out-of-bounds.

## Threadgroup size guidelines

| Pattern              | Recommended `threads`                                        |
| -------------------- | ------------------------------------------------------------ |
| 1D elementwise       | `(256, 1, 1)`                                                |
| 2D tiled             | `(16, 16, 1)`                                                |
| SIMD reduction       | `(256, 1, 1)` — 8 SIMD groups                                |
| Shared memory kernel | Tune based on `max_threads_per_threadgroup` from device caps |

## Device limits

Query capabilities at runtime:

```python theme={null}
caps = rt.device_capabilities()
print(caps.max_threads_per_threadgroup)   # typically 1024
print(caps.simdgroup_size)               # typically 32
```
