SIMD Group Operations

Apple Metal groups 32 threads into a SIMD group (also called a warp). SIMD group operations communicate data across all 32 lanes without going through memory or barriers.

When to use SIMD group ops

Parallel reductions within a threadgroup (sum, min, max)
Prefix scan algorithms
Exchanging values between lanes without shared memory

All SIMD group ops are valid inside @enigma.kernel bodies.

Reductions

Each reduction takes one value per lane and returns the same result to all 32 lanes.

Function	Operation	Types
`enigma.simd_sum(x)`	Sum across 32 lanes	`f32`, `i32`, `u32`
`enigma.simd_product(x)`	Product across 32 lanes	`f32`, `i32`, `u32`
`enigma.simd_min(x)`	Minimum across 32 lanes	`f32`, `i32`, `u32`
`enigma.simd_max(x)`	Maximum across 32 lanes	`f32`, `i32`, `u32`
`enigma.simd_and(x)`	Bitwise AND across 32 lanes	`i32`, `u32`
`enigma.simd_or(x)`	Bitwise OR across 32 lanes	`i32`, `u32`
`enigma.simd_xor(x)`	Bitwise XOR across 32 lanes	`i32`, `u32`

Example: sum reduction

@enigma.kernel
def simd_reduce_sum(A: enigma.f32, Out: enigma.f32):
    tid = enigma.thread_position_in_grid
    simd_idx = enigma.thread_index_in_simdgroup()
    simd_gid = enigma.simdgroup_index_in_threadgroup()

    val = A[tid]
    total = enigma.simd_sum(val)   # every lane gets the total

    # Lane 0 of each SIMD group writes to output
    with enigma.if_(enigma.cmp_eq(simd_idx, 0)):
        Out[simd_gid] = total

Prefix scans

Prefix scans return partial results: lane i gets the reduction of lanes [0, i) (exclusive) or [0, i] (inclusive).

Function	Description
`enigma.simd_prefix_exclusive_sum(x)`	Exclusive prefix sum
`enigma.simd_prefix_inclusive_sum(x)`	Inclusive prefix sum
`enigma.simd_prefix_exclusive_product(x)`	Exclusive prefix product
`enigma.simd_prefix_inclusive_product(x)`	Inclusive prefix product

Example: exclusive prefix sum

lane_id = enigma.thread_index_in_simdgroup()   # 0..31
val = enigma.f32(1.0)
prefix = enigma.simd_prefix_exclusive_sum(val)
# Lane 0 → 0.0, Lane 1 → 1.0, ..., Lane 31 → 31.0

Shuffle operations

Shuffle operations exchange values between specific lanes at hardware speed.

Function	Description
`enigma.simd_shuffle(value, lane)`	Broadcast value from a specific lane to all lanes
`enigma.simd_shuffle_up(value, delta)`	Each lane receives value from `lane - delta`
`enigma.simd_shuffle_down(value, delta)`	Each lane receives value from `lane + delta`
`enigma.simd_shuffle_xor(value, mask)`	Each lane receives value from `lane XOR mask`
`enigma.simd_broadcast(value, lane)`	Broadcast from a constant lane index

Example: butterfly reduction using shuffle_xor

val = A[tid]
val = val + enigma.simd_shuffle_xor(val, 1)    # pair-wise
val = val + enigma.simd_shuffle_xor(val, 2)    # quads
val = val + enigma.simd_shuffle_xor(val, 4)    # octets
val = val + enigma.simd_shuffle_xor(val, 8)
val = val + enigma.simd_shuffle_xor(val, 16)   # full SIMD group
Out[tid] = val

Quad group operations

Quad groups are 4-thread subgroups within a SIMD group. They provide the same operations at finer granularity.

Function	Description
`enigma.quad_sum(x)`	Sum across 4-lane quad
`enigma.quad_product(x)`	Product across 4-lane quad
`enigma.quad_min(x)`	Minimum across 4-lane quad
`enigma.quad_max(x)`	Maximum across 4-lane quad
`enigma.quad_and(x)` / `enigma.quad_or(x)` / `enigma.quad_xor(x)`	Bitwise ops
`enigma.quad_prefix_exclusive_sum(x)`	Exclusive prefix sum within quad
`enigma.quad_prefix_inclusive_sum(x)`	Inclusive prefix sum within quad
`enigma.quad_shuffle(value, lane)`	Shuffle within quad
`enigma.quad_shuffle_up(value, delta)`	Shift up within quad
`enigma.quad_shuffle_down(value, delta)`	Shift down within quad
`enigma.quad_shuffle_xor(value, mask)`	XOR shuffle within quad

When to prefer quad ops

Quad ops are useful in texture-domain compute (2×2 pixel quads) and when you need sub-SIMD-group granularity in reduction trees.

​When to use SIMD group ops

​Reductions

​Example: sum reduction

​Prefix scans

​Example: exclusive prefix sum

​Shuffle operations

​Example: butterfly reduction using shuffle_xor

​Quad group operations

​When to prefer quad ops