> ## Documentation Index
> Fetch the complete documentation index at: https://klyne-research.mintlify.site/llms.txt
> Use this file to discover all available pages before exploring further.

# Threadgroup Memory & Synchronization

> Barriers, threadgroup shared memory allocation, and the buffer/threadgroup API surface.

This page documents the API for threadgroup-local shared memory and the
synchronization primitives that order accesses to it.

For the conceptual model, see
[Memory Model](/concepts/memory-model). For atomic operations, see
[Atomic Operations](/programming-guide/atomics).

## Barriers

### `enigma.barrier(mem_flags="mem_threadgroup") -> None`

Threadgroup barrier — every thread in the threadgroup must reach this point
before any can continue.

| Parameter   | Type | Default             | Description        |
| ----------- | ---- | ------------------- | ------------------ |
| `mem_flags` | str  | `"mem_threadgroup"` | Memory fence scope |

Accepted `mem_flags` values:

| Value                          | Fences                                  |
| ------------------------------ | --------------------------------------- |
| `"mem_none"`                   | Execution barrier only, no memory fence |
| `"mem_device"`                 | Device memory                           |
| `"mem_threadgroup"`            | Threadgroup memory (default)            |
| `"mem_device_and_threadgroup"` | Both                                    |
| `"mem_texture"`                | Texture memory                          |

### `enigma.simd_barrier(mem_flags="mem_threadgroup") -> None`

SIMD-group barrier. Synchronizes only the 32 threads of one SIMD group.
Cheaper than a full threadgroup barrier when ordering is only needed within
a warp.

Same `mem_flags` values as `enigma.barrier()`.

***

## Threadgroup shared memory

### `enigma.threadgroup_alloc(dtype, size) -> TracingTensor`

Allocates `threadgroup T[size]` storage local to the threadgroup.

| Parameter | Type | Description                                               |
| --------- | ---- | --------------------------------------------------------- |
| `dtype`   | str  | Element type: `"float"`, `"half"`, `"int"`, `"uint"`, ... |
| `size`    | int  | Number of elements (compile-time constant)                |

**Returns:** A `TracingTensor` supporting:

* Indexed access — `shared[i]` for load, `shared[i] = v` for store
* Atomic methods — `shared.atomic_fetch_add(i, v)`, `shared.atomic_load(i)`, etc.
  (see [Atomic Operations](/programming-guide/atomics))

### Example: reverse via shared memory

```python theme={null}
@enigma.kernel
def reverse(A: enigma.f32, B: enigma.f32):
    tid = enigma.thread_position_in_grid
    lid = enigma.thread_position_in_threadgroup()
    bsize = enigma.threads_per_threadgroup()

    shared = enigma.threadgroup_alloc("float", 256)
    shared[lid] = A[tid]
    enigma.barrier()
    B[tid] = shared[bsize - lid - 1]
```

***

## Buffer indexing

Inside a kernel body, kernel-parameter buffers and threadgroup allocations
both support index access:

```python theme={null}
val = A[index]      # load
A[index] = val      # store
```

`index` may be:

* An `IRValue` (from a grid query, arithmetic op, etc.)
* A Python `int` — auto-wrapped as a `uint` constant

Multi-dimensional indexing uses tuple form (for register/tile tensors):

```python theme={null}
acc[i, j] = enigma.fma(a[i, k], b[k, j], acc[i, j])
```

***

## Masked load / store

Predicated memory access for boundary handling without a full `enigma.if_`.

### `enigma.load_if(buf, index, mask, default=0) -> IRValue`

Loads `buf[index]` when `mask` is true, otherwise returns `default`. Always
emits an unconditional load followed by a select — safe for unrolled inner
loops where branching would inhibit vectorization.

| Parameter | Type                   | Description                       |
| --------- | ---------------------- | --------------------------------- |
| `buf`     | TracingTensor          | Source buffer                     |
| `index`   | IRValue or int         | Element index                     |
| `mask`    | IRValue (i1)           | Predicate                         |
| `default` | IRValue, int, or float | Value returned when mask is false |

### `enigma.store_if(buf, index, value, mask) -> None`

Stores `value` to `buf[index]` only when `mask` is true. Lowered as
`scf.if { store }`.

```python theme={null}
mask = enigma.cmp_ult(tid, enigma.metal_cast(N, "uint"))
val  = enigma.load_if(A, tid, mask, default=0.0)
enigma.store_if(C, tid, val, mask)
```

***

## Async copy (experimental)

Non-blocking device ↔ threadgroup data movement via AIR intrinsics. Requires
M3 or newer.

| Function                                                                                       | Description                            |
| ---------------------------------------------------------------------------------------------- | -------------------------------------- |
| `enigma.async_copy_1d_d2t(dst, dst_off, src, src_off, count)`                                  | 1D device → threadgroup. Returns event |
| `enigma.async_copy_1d_t2d(dst, dst_off, src, src_off, count)`                                  | 1D threadgroup → device. Returns event |
| `enigma.async_copy_2d_d2t(dst, dst_off, dst_epr, src, src_off, src_epr, tile_cols, tile_rows)` | 2D tile device → threadgroup           |
| `enigma.async_copy_2d_t2d(dst, dst_off, dst_epr, src, src_off, src_epr, tile_cols, tile_rows)` | 2D tile threadgroup → device           |
| `enigma.async_copy_wait(*events)`                                                              | Block until events complete            |

See [Async Copy](/programming-guide/async-copy) for parameter details and
caveats.

***

## See also

* [Atomic Operations](/programming-guide/atomics) — full atomic API including CAS
* [SIMD Group Operations](/programming-guide/simd-group-ops) — communication without barriers
* [Registers, Copy & Pipeline](/programming-guide/register-and-pipeline) — `register_tensor`, `copy`, `pipeline`
