Skip to main content
This page documents the API for threadgroup-local shared memory and the synchronization primitives that order accesses to it. For the conceptual model, see Memory Model. For atomic operations, see Atomic Operations.

Barriers

enigma.barrier(mem_flags="mem_threadgroup") -> None

Threadgroup barrier — every thread in the threadgroup must reach this point before any can continue.
ParameterTypeDefaultDescription
mem_flagsstr"mem_threadgroup"Memory fence scope
Accepted mem_flags values:
ValueFences
"mem_none"Execution barrier only, no memory fence
"mem_device"Device memory
"mem_threadgroup"Threadgroup memory (default)
"mem_device_and_threadgroup"Both
"mem_texture"Texture memory

enigma.simd_barrier(mem_flags="mem_threadgroup") -> None

SIMD-group barrier. Synchronizes only the 32 threads of one SIMD group. Cheaper than a full threadgroup barrier when ordering is only needed within a warp. Same mem_flags values as enigma.barrier().

Threadgroup shared memory

enigma.threadgroup_alloc(dtype, size) -> TracingTensor

Allocates threadgroup T[size] storage local to the threadgroup.
ParameterTypeDescription
dtypestrElement type: "float", "half", "int", "uint", …
sizeintNumber of elements (compile-time constant)
Returns: A TracingTensor supporting:
  • Indexed access — shared[i] for load, shared[i] = v for store
  • Atomic methods — shared.atomic_fetch_add(i, v), shared.atomic_load(i), etc. (see Atomic Operations)

Example: reverse via shared memory

@enigma.kernel
def reverse(A: enigma.f32, B: enigma.f32):
    tid = enigma.thread_position_in_grid
    lid = enigma.thread_position_in_threadgroup()
    bsize = enigma.threads_per_threadgroup()

    shared = enigma.threadgroup_alloc("float", 256)
    shared[lid] = A[tid]
    enigma.barrier()
    B[tid] = shared[bsize - lid - 1]

Buffer indexing

Inside a kernel body, kernel-parameter buffers and threadgroup allocations both support index access:
val = A[index]      # load
A[index] = val      # store
index may be:
  • An IRValue (from a grid query, arithmetic op, etc.)
  • A Python int — auto-wrapped as a uint constant
Multi-dimensional indexing uses tuple form (for register/tile tensors):
acc[i, j] = enigma.fma(a[i, k], b[k, j], acc[i, j])

Masked load / store

Predicated memory access for boundary handling without a full enigma.if_.

enigma.load_if(buf, index, mask, default=0) -> IRValue

Loads buf[index] when mask is true, otherwise returns default. Always emits an unconditional load followed by a select — safe for unrolled inner loops where branching would inhibit vectorization.
ParameterTypeDescription
bufTracingTensorSource buffer
indexIRValue or intElement index
maskIRValue (i1)Predicate
defaultIRValue, int, or floatValue returned when mask is false

enigma.store_if(buf, index, value, mask) -> None

Stores value to buf[index] only when mask is true. Lowered as scf.if { store }.
mask = enigma.cmp_ult(tid, enigma.metal_cast(N, "uint"))
val  = enigma.load_if(A, tid, mask, default=0.0)
enigma.store_if(C, tid, val, mask)

Async copy (experimental)

Non-blocking device ↔ threadgroup data movement via AIR intrinsics. Requires M3 or newer.
FunctionDescription
enigma.async_copy_1d_d2t(dst, dst_off, src, src_off, count)1D device → threadgroup. Returns event
enigma.async_copy_1d_t2d(dst, dst_off, src, src_off, count)1D threadgroup → device. Returns event
enigma.async_copy_2d_d2t(dst, dst_off, dst_epr, src, src_off, src_epr, tile_cols, tile_rows)2D tile device → threadgroup
enigma.async_copy_2d_t2d(dst, dst_off, dst_epr, src, src_off, src_epr, tile_cols, tile_rows)2D tile threadgroup → device
enigma.async_copy_wait(*events)Block until events complete
See Async Copy for parameter details and caveats.

See also