> ## Documentation Index
> Fetch the complete documentation index at: https://klyne-research.mintlify.site/llms.txt
> Use this file to discover all available pages before exploring further.

# SIMD Group Operations

> Reductions, prefix scans, and shuffle operations across a SIMD group (32 threads).

Apple Metal groups 32 threads into a SIMD group (also called a warp). SIMD group operations communicate data across all 32 lanes without going through memory or barriers.

## When to use SIMD group ops

* Parallel reductions within a threadgroup (sum, min, max)
* Prefix scan algorithms
* Exchanging values between lanes without shared memory

All SIMD group ops are valid inside `@enigma.kernel` bodies.

## Reductions

Each reduction takes one value per lane and returns the same result to all 32 lanes.

| Function                 | Operation                   | Types               |
| ------------------------ | --------------------------- | ------------------- |
| `enigma.simd_sum(x)`     | Sum across 32 lanes         | `f32`, `i32`, `u32` |
| `enigma.simd_product(x)` | Product across 32 lanes     | `f32`, `i32`, `u32` |
| `enigma.simd_min(x)`     | Minimum across 32 lanes     | `f32`, `i32`, `u32` |
| `enigma.simd_max(x)`     | Maximum across 32 lanes     | `f32`, `i32`, `u32` |
| `enigma.simd_and(x)`     | Bitwise AND across 32 lanes | `i32`, `u32`        |
| `enigma.simd_or(x)`      | Bitwise OR across 32 lanes  | `i32`, `u32`        |
| `enigma.simd_xor(x)`     | Bitwise XOR across 32 lanes | `i32`, `u32`        |

### Example: sum reduction

```python theme={null}
@enigma.kernel
def simd_reduce_sum(A: enigma.f32, Out: enigma.f32):
    tid = enigma.thread_position_in_grid
    simd_idx = enigma.thread_index_in_simdgroup()
    simd_gid = enigma.simdgroup_index_in_threadgroup()

    val = A[tid]
    total = enigma.simd_sum(val)   # every lane gets the total

    # Lane 0 of each SIMD group writes to output
    with enigma.if_(enigma.cmp_eq(simd_idx, 0)):
        Out[simd_gid] = total
```

## Prefix scans

Prefix scans return partial results: lane `i` gets the reduction of lanes `[0, i)` (exclusive) or `[0, i]` (inclusive).

| Function                                  | Description              |
| ----------------------------------------- | ------------------------ |
| `enigma.simd_prefix_exclusive_sum(x)`     | Exclusive prefix sum     |
| `enigma.simd_prefix_inclusive_sum(x)`     | Inclusive prefix sum     |
| `enigma.simd_prefix_exclusive_product(x)` | Exclusive prefix product |
| `enigma.simd_prefix_inclusive_product(x)` | Inclusive prefix product |

### Example: exclusive prefix sum

```python theme={null}
lane_id = enigma.thread_index_in_simdgroup()   # 0..31
val = enigma.f32(1.0)
prefix = enigma.simd_prefix_exclusive_sum(val)
# Lane 0 → 0.0, Lane 1 → 1.0, ..., Lane 31 → 31.0
```

## Shuffle operations

Shuffle operations exchange values between specific lanes at hardware speed.

| Function                                 | Description                                       |
| ---------------------------------------- | ------------------------------------------------- |
| `enigma.simd_shuffle(value, lane)`       | Broadcast value from a specific lane to all lanes |
| `enigma.simd_shuffle_up(value, delta)`   | Each lane receives value from `lane - delta`      |
| `enigma.simd_shuffle_down(value, delta)` | Each lane receives value from `lane + delta`      |
| `enigma.simd_shuffle_xor(value, mask)`   | Each lane receives value from `lane XOR mask`     |
| `enigma.simd_broadcast(value, lane)`     | Broadcast from a constant lane index              |

### Example: butterfly reduction using shuffle\_xor

```python theme={null}
val = A[tid]
val = val + enigma.simd_shuffle_xor(val, 1)    # pair-wise
val = val + enigma.simd_shuffle_xor(val, 2)    # quads
val = val + enigma.simd_shuffle_xor(val, 4)    # octets
val = val + enigma.simd_shuffle_xor(val, 8)
val = val + enigma.simd_shuffle_xor(val, 16)   # full SIMD group
Out[tid] = val
```

***

## Quad group operations

Quad groups are 4-thread subgroups within a SIMD group. They provide the same operations at finer granularity.

| Function                                                          | Description                      |
| ----------------------------------------------------------------- | -------------------------------- |
| `enigma.quad_sum(x)`                                              | Sum across 4-lane quad           |
| `enigma.quad_product(x)`                                          | Product across 4-lane quad       |
| `enigma.quad_min(x)`                                              | Minimum across 4-lane quad       |
| `enigma.quad_max(x)`                                              | Maximum across 4-lane quad       |
| `enigma.quad_and(x)` / `enigma.quad_or(x)` / `enigma.quad_xor(x)` | Bitwise ops                      |
| `enigma.quad_prefix_exclusive_sum(x)`                             | Exclusive prefix sum within quad |
| `enigma.quad_prefix_inclusive_sum(x)`                             | Inclusive prefix sum within quad |
| `enigma.quad_shuffle(value, lane)`                                | Shuffle within quad              |
| `enigma.quad_shuffle_up(value, delta)`                            | Shift up within quad             |
| `enigma.quad_shuffle_down(value, delta)`                          | Shift down within quad           |
| `enigma.quad_shuffle_xor(value, mask)`                            | XOR shuffle within quad          |

### When to prefer quad ops

Quad ops are useful in texture-domain compute (2×2 pixel quads) and when you need sub-SIMD-group granularity in reduction trees.
