Threadgroup Memory & Synchronization

This page documents the API for threadgroup-local shared memory and the synchronization primitives that order accesses to it. For the conceptual model, see Memory Model. For atomic operations, see Atomic Operations.

Barriers

`enigma.barrier(mem_flags="mem_threadgroup") -> None`

Threadgroup barrier — every thread in the threadgroup must reach this point before any can continue.

Parameter	Type	Default	Description
`mem_flags`	str	`"mem_threadgroup"`	Memory fence scope

Accepted mem_flags values:

Value	Fences
`"mem_none"`	Execution barrier only, no memory fence
`"mem_device"`	Device memory
`"mem_threadgroup"`	Threadgroup memory (default)
`"mem_device_and_threadgroup"`	Both
`"mem_texture"`	Texture memory

`enigma.simd_barrier(mem_flags="mem_threadgroup") -> None`

SIMD-group barrier. Synchronizes only the 32 threads of one SIMD group. Cheaper than a full threadgroup barrier when ordering is only needed within a warp. Same mem_flags values as enigma.barrier().

Threadgroup shared memory

`enigma.threadgroup_alloc(dtype, size) -> TracingTensor`

Allocates threadgroup T[size] storage local to the threadgroup.

Parameter	Type	Description
`dtype`	str	Element type: `"float"`, `"half"`, `"int"`, `"uint"`, …
`size`	int	Number of elements (compile-time constant)

Returns: A TracingTensor supporting:

Indexed access — shared[i] for load, shared[i] = v for store
Atomic methods — shared.atomic_fetch_add(i, v), shared.atomic_load(i), etc. (see Atomic Operations)

Example: reverse via shared memory

@enigma.kernel
def reverse(A: enigma.f32, B: enigma.f32):
    tid = enigma.thread_position_in_grid
    lid = enigma.thread_position_in_threadgroup()
    bsize = enigma.threads_per_threadgroup()

    shared = enigma.threadgroup_alloc("float", 256)
    shared[lid] = A[tid]
    enigma.barrier()
    B[tid] = shared[bsize - lid - 1]

Buffer indexing

Inside a kernel body, kernel-parameter buffers and threadgroup allocations both support index access:

val = A[index]      # load
A[index] = val      # store

index may be:

An IRValue (from a grid query, arithmetic op, etc.)
A Python int — auto-wrapped as a uint constant

Multi-dimensional indexing uses tuple form (for register/tile tensors):

acc[i, j] = enigma.fma(a[i, k], b[k, j], acc[i, j])

Masked load / store

Predicated memory access for boundary handling without a full enigma.if_.

`enigma.load_if(buf, index, mask, default=0) -> IRValue`

Loads buf[index] when mask is true, otherwise returns default. Always emits an unconditional load followed by a select — safe for unrolled inner loops where branching would inhibit vectorization.

Parameter	Type	Description
`buf`	TracingTensor	Source buffer
`index`	IRValue or int	Element index
`mask`	IRValue (i1)	Predicate
`default`	IRValue, int, or float	Value returned when mask is false

`enigma.store_if(buf, index, value, mask) -> None`

Stores value to buf[index] only when mask is true. Lowered as scf.if { store }.

mask = enigma.cmp_ult(tid, enigma.metal_cast(N, "uint"))
val  = enigma.load_if(A, tid, mask, default=0.0)
enigma.store_if(C, tid, val, mask)

Async copy (experimental)

Non-blocking device ↔ threadgroup data movement via AIR intrinsics. Requires M3 or newer.

Function	Description
`enigma.async_copy_1d_d2t(dst, dst_off, src, src_off, count)`	1D device → threadgroup. Returns event
`enigma.async_copy_1d_t2d(dst, dst_off, src, src_off, count)`	1D threadgroup → device. Returns event
`enigma.async_copy_2d_d2t(dst, dst_off, dst_epr, src, src_off, src_epr, tile_cols, tile_rows)`	2D tile device → threadgroup
`enigma.async_copy_2d_t2d(dst, dst_off, dst_epr, src, src_off, src_epr, tile_cols, tile_rows)`	2D tile threadgroup → device
`enigma.async_copy_wait(*events)`	Block until events complete

See Async Copy for parameter details and caveats.

​Barriers

​enigma.barrier(mem_flags="mem_threadgroup") -> None

​enigma.simd_barrier(mem_flags="mem_threadgroup") -> None

​Threadgroup shared memory

​enigma.threadgroup_alloc(dtype, size) -> TracingTensor

​Example: reverse via shared memory

​Buffer indexing

​Masked load / store

​enigma.load_if(buf, index, mask, default=0) -> IRValue

​enigma.store_if(buf, index, value, mask) -> None

​Async copy (experimental)

​See also