Skip to main content
Apple Metal groups 32 threads into a SIMD group (also called a warp). SIMD group operations communicate data across all 32 lanes without going through memory or barriers.

When to use SIMD group ops

  • Parallel reductions within a threadgroup (sum, min, max)
  • Prefix scan algorithms
  • Exchanging values between lanes without shared memory
All SIMD group ops are valid inside @enigma.kernel bodies.

Reductions

Each reduction takes one value per lane and returns the same result to all 32 lanes.
FunctionOperationTypes
enigma.simd_sum(x)Sum across 32 lanesf32, i32, u32
enigma.simd_product(x)Product across 32 lanesf32, i32, u32
enigma.simd_min(x)Minimum across 32 lanesf32, i32, u32
enigma.simd_max(x)Maximum across 32 lanesf32, i32, u32
enigma.simd_and(x)Bitwise AND across 32 lanesi32, u32
enigma.simd_or(x)Bitwise OR across 32 lanesi32, u32
enigma.simd_xor(x)Bitwise XOR across 32 lanesi32, u32

Example: sum reduction

@enigma.kernel
def simd_reduce_sum(A: enigma.f32, Out: enigma.f32):
    tid = enigma.thread_position_in_grid
    simd_idx = enigma.thread_index_in_simdgroup()
    simd_gid = enigma.simdgroup_index_in_threadgroup()

    val = A[tid]
    total = enigma.simd_sum(val)   # every lane gets the total

    # Lane 0 of each SIMD group writes to output
    with enigma.if_(enigma.cmp_eq(simd_idx, 0)):
        Out[simd_gid] = total

Prefix scans

Prefix scans return partial results: lane i gets the reduction of lanes [0, i) (exclusive) or [0, i] (inclusive).
FunctionDescription
enigma.simd_prefix_exclusive_sum(x)Exclusive prefix sum
enigma.simd_prefix_inclusive_sum(x)Inclusive prefix sum
enigma.simd_prefix_exclusive_product(x)Exclusive prefix product
enigma.simd_prefix_inclusive_product(x)Inclusive prefix product

Example: exclusive prefix sum

lane_id = enigma.thread_index_in_simdgroup()   # 0..31
val = enigma.f32(1.0)
prefix = enigma.simd_prefix_exclusive_sum(val)
# Lane 0 → 0.0, Lane 1 → 1.0, ..., Lane 31 → 31.0

Shuffle operations

Shuffle operations exchange values between specific lanes at hardware speed.
FunctionDescription
enigma.simd_shuffle(value, lane)Broadcast value from a specific lane to all lanes
enigma.simd_shuffle_up(value, delta)Each lane receives value from lane - delta
enigma.simd_shuffle_down(value, delta)Each lane receives value from lane + delta
enigma.simd_shuffle_xor(value, mask)Each lane receives value from lane XOR mask
enigma.simd_broadcast(value, lane)Broadcast from a constant lane index

Example: butterfly reduction using shuffle_xor

val = A[tid]
val = val + enigma.simd_shuffle_xor(val, 1)    # pair-wise
val = val + enigma.simd_shuffle_xor(val, 2)    # quads
val = val + enigma.simd_shuffle_xor(val, 4)    # octets
val = val + enigma.simd_shuffle_xor(val, 8)
val = val + enigma.simd_shuffle_xor(val, 16)   # full SIMD group
Out[tid] = val

Quad group operations

Quad groups are 4-thread subgroups within a SIMD group. They provide the same operations at finer granularity.
FunctionDescription
enigma.quad_sum(x)Sum across 4-lane quad
enigma.quad_product(x)Product across 4-lane quad
enigma.quad_min(x)Minimum across 4-lane quad
enigma.quad_max(x)Maximum across 4-lane quad
enigma.quad_and(x) / enigma.quad_or(x) / enigma.quad_xor(x)Bitwise ops
enigma.quad_prefix_exclusive_sum(x)Exclusive prefix sum within quad
enigma.quad_prefix_inclusive_sum(x)Inclusive prefix sum within quad
enigma.quad_shuffle(value, lane)Shuffle within quad
enigma.quad_shuffle_up(value, delta)Shift up within quad
enigma.quad_shuffle_down(value, delta)Shift down within quad
enigma.quad_shuffle_xor(value, mask)XOR shuffle within quad

When to prefer quad ops

Quad ops are useful in texture-domain compute (2×2 pixel quads) and when you need sub-SIMD-group granularity in reduction trees.