Barriers
enigma.barrier(mem_flags="mem_threadgroup") -> None
Threadgroup barrier — every thread in the threadgroup must reach this point
before any can continue.
| Parameter | Type | Default | Description |
|---|---|---|---|
mem_flags | str | "mem_threadgroup" | Memory fence scope |
mem_flags values:
| Value | Fences |
|---|---|
"mem_none" | Execution barrier only, no memory fence |
"mem_device" | Device memory |
"mem_threadgroup" | Threadgroup memory (default) |
"mem_device_and_threadgroup" | Both |
"mem_texture" | Texture memory |
enigma.simd_barrier(mem_flags="mem_threadgroup") -> None
SIMD-group barrier. Synchronizes only the 32 threads of one SIMD group.
Cheaper than a full threadgroup barrier when ordering is only needed within
a warp.
Same mem_flags values as enigma.barrier().
Threadgroup shared memory
enigma.threadgroup_alloc(dtype, size) -> TracingTensor
Allocates threadgroup T[size] storage local to the threadgroup.
| Parameter | Type | Description |
|---|---|---|
dtype | str | Element type: "float", "half", "int", "uint", … |
size | int | Number of elements (compile-time constant) |
TracingTensor supporting:
- Indexed access —
shared[i]for load,shared[i] = vfor store - Atomic methods —
shared.atomic_fetch_add(i, v),shared.atomic_load(i), etc. (see Atomic Operations)
Example: reverse via shared memory
Buffer indexing
Inside a kernel body, kernel-parameter buffers and threadgroup allocations both support index access:index may be:
- An
IRValue(from a grid query, arithmetic op, etc.) - A Python
int— auto-wrapped as auintconstant
Masked load / store
Predicated memory access for boundary handling without a fullenigma.if_.
enigma.load_if(buf, index, mask, default=0) -> IRValue
Loads buf[index] when mask is true, otherwise returns default. Always
emits an unconditional load followed by a select — safe for unrolled inner
loops where branching would inhibit vectorization.
| Parameter | Type | Description |
|---|---|---|
buf | TracingTensor | Source buffer |
index | IRValue or int | Element index |
mask | IRValue (i1) | Predicate |
default | IRValue, int, or float | Value returned when mask is false |
enigma.store_if(buf, index, value, mask) -> None
Stores value to buf[index] only when mask is true. Lowered as
scf.if { store }.
Async copy (experimental)
Non-blocking device ↔ threadgroup data movement via AIR intrinsics. Requires M3 or newer.| Function | Description |
|---|---|
enigma.async_copy_1d_d2t(dst, dst_off, src, src_off, count) | 1D device → threadgroup. Returns event |
enigma.async_copy_1d_t2d(dst, dst_off, src, src_off, count) | 1D threadgroup → device. Returns event |
enigma.async_copy_2d_d2t(dst, dst_off, dst_epr, src, src_off, src_epr, tile_cols, tile_rows) | 2D tile device → threadgroup |
enigma.async_copy_2d_t2d(dst, dst_off, dst_epr, src, src_off, src_epr, tile_cols, tile_rows) | 2D tile threadgroup → device |
enigma.async_copy_wait(*events) | Block until events complete |
See also
- Atomic Operations — full atomic API including CAS
- SIMD Group Operations — communication without barriers
- Registers, Copy & Pipeline —
register_tensor,copy,pipeline
