When to use SIMD group ops
- Parallel reductions within a threadgroup (sum, min, max)
- Prefix scan algorithms
- Exchanging values between lanes without shared memory
@enigma.kernel bodies.
Reductions
Each reduction takes one value per lane and returns the same result to all 32 lanes.| Function | Operation | Types |
|---|---|---|
enigma.simd_sum(x) | Sum across 32 lanes | f32, i32, u32 |
enigma.simd_product(x) | Product across 32 lanes | f32, i32, u32 |
enigma.simd_min(x) | Minimum across 32 lanes | f32, i32, u32 |
enigma.simd_max(x) | Maximum across 32 lanes | f32, i32, u32 |
enigma.simd_and(x) | Bitwise AND across 32 lanes | i32, u32 |
enigma.simd_or(x) | Bitwise OR across 32 lanes | i32, u32 |
enigma.simd_xor(x) | Bitwise XOR across 32 lanes | i32, u32 |
Example: sum reduction
Prefix scans
Prefix scans return partial results: lanei gets the reduction of lanes [0, i) (exclusive) or [0, i] (inclusive).
| Function | Description |
|---|---|
enigma.simd_prefix_exclusive_sum(x) | Exclusive prefix sum |
enigma.simd_prefix_inclusive_sum(x) | Inclusive prefix sum |
enigma.simd_prefix_exclusive_product(x) | Exclusive prefix product |
enigma.simd_prefix_inclusive_product(x) | Inclusive prefix product |
Example: exclusive prefix sum
Shuffle operations
Shuffle operations exchange values between specific lanes at hardware speed.| Function | Description |
|---|---|
enigma.simd_shuffle(value, lane) | Broadcast value from a specific lane to all lanes |
enigma.simd_shuffle_up(value, delta) | Each lane receives value from lane - delta |
enigma.simd_shuffle_down(value, delta) | Each lane receives value from lane + delta |
enigma.simd_shuffle_xor(value, mask) | Each lane receives value from lane XOR mask |
enigma.simd_broadcast(value, lane) | Broadcast from a constant lane index |
Example: butterfly reduction using shuffle_xor
Quad group operations
Quad groups are 4-thread subgroups within a SIMD group. They provide the same operations at finer granularity.| Function | Description |
|---|---|
enigma.quad_sum(x) | Sum across 4-lane quad |
enigma.quad_product(x) | Product across 4-lane quad |
enigma.quad_min(x) | Minimum across 4-lane quad |
enigma.quad_max(x) | Maximum across 4-lane quad |
enigma.quad_and(x) / enigma.quad_or(x) / enigma.quad_xor(x) | Bitwise ops |
enigma.quad_prefix_exclusive_sum(x) | Exclusive prefix sum within quad |
enigma.quad_prefix_inclusive_sum(x) | Inclusive prefix sum within quad |
enigma.quad_shuffle(value, lane) | Shuffle within quad |
enigma.quad_shuffle_up(value, delta) | Shift up within quad |
enigma.quad_shuffle_down(value, delta) | Shift down within quad |
enigma.quad_shuffle_xor(value, mask) | XOR shuffle within quad |
