Grid and threadgroups
A kernel launch is defined by two parameters:grid=(gx, gy, gz)— total number of threads in each dimension across the entire dispatchthreads=(tx, ty, tz)— number of threads per threadgroup in each dimension
grid / threads threadgroups. Each threadgroup runs concurrently on a single GPU core, sharing threadgroup (shared) memory.
Thread index queries
Inside@enigma.kernel, use these functions to obtain thread coordinates as IRValue objects:
Global position (most common)
Threadgroup-relative position
SIMD group queries
Launch patterns
1D — elementwise
For kernels where each thread handles one element:2D — matrix kernels
For kernels operating on 2D data, usey for rows and x for columns (standard convention):
Vectorized — vec_width
When you compile with vec_width=4, each thread processes 4 elements. Divide the grid accordingly:
Launch configuration rules
- Every logical element must be covered by exactly one thread.
output_sizeis in bytes, not elements. Forf32:elements × 4.- Threadgroup size should be a multiple of the SIMD group size (32) for best occupancy.
- For padded domains, use predicated stores (
store_if) to avoid writing out-of-bounds.
Threadgroup size guidelines
| Pattern | Recommended threads |
|---|---|
| 1D elementwise | (256, 1, 1) |
| 2D tiled | (16, 16, 1) |
| SIMD reduction | (256, 1, 1) — 8 SIMD groups |
| Shared memory kernel | Tune based on max_threads_per_threadgroup from device caps |
