Skip to main content
Metal defines four address spaces that correspond to different physical memory locations on the GPU. Enigma maps these to distinct kernel operations.

Address spaces

SpaceMetal keywordScopeUse
devicedeviceGPU-wideInput/output buffers, default for kernel parameters
constantconstantRead-only, cachedUniform data shared across all threads
threadgroupthreadgroupOne threadgroupShared memory for cooperation within a threadgroup
threadthreadSingle threadPrivate register-level storage

Device memory

All kernel buffer parameters (A: enigma.f32, etc.) are in device memory. Indexed access emits global loads and stores:
val = A[tid]        # device load
C[tid] = val        # device store
With vec_width=4, Enigma emits device float4* paths where alignment allows, improving memory bandwidth utilization.

Threadgroup (shared) memory

Threadgroup memory is a scratchpad shared by all threads in a threadgroup. It is fast, limited (typically 32 KB), and requires explicit synchronization. Allocate with enigma.threadgroup_alloc:
@enigma.kernel
def reduce(A: enigma.f32, Out: enigma.f32):
    local_id = enigma.thread_position_in_threadgroup("x")
    
    # Allocate 256 floats in shared memory
    scratch = enigma.threadgroup_alloc("float", 256)
    
    # Load from device into shared memory
    tid = enigma.thread_position_in_grid
    scratch[local_id] = A[tid]
    
    # Synchronize before reading neighbors
    enigma.barrier(mem_flags="mem_threadgroup")
    
    # Read neighbor
    neighbor = scratch[local_id + 1]
    Out[tid] = neighbor
Key rules:
  • Always call enigma.barrier() between a write and a cross-thread read.
  • Allocation size is fixed at compile time.
  • The mem_flags argument controls which memory types the barrier synchronizes.

Barrier flags

FlagSynchronizes
"mem_none"Execution order only
"mem_device"Device (global) memory
"mem_threadgroup"Threadgroup (shared) memory
"mem_device_and_threadgroup"Both
enigma.barrier(mem_flags="mem_threadgroup")      # threadgroup barrier
enigma.simd_barrier(mem_flags="mem_threadgroup") # SIMD group barrier

Scalar parameters

enigma.Scalar(dtype) parameters are lowered as 1-element device buffers and auto-loaded at kernel entry. Use them for per-dispatch constants that vary across calls:
@enigma.kernel
def scale(A: enigma.f32, factor: enigma.Scalar(enigma.f32), C: enigma.f32):
    tid = enigma.thread_position_in_grid
    C[tid] = A[tid] * factor

Higher-level memory primitives

The DSL provides register-and-pipeline building blocks for staged compute patterns:
PrimitivePurpose
enigma.register_tensor(...)Register-level tile allocation
enigma.copy(src, dst, count=...)Explicit data movement
enigma.pipeline(dtype, size, stages=...)Software pipeline staging
enigma.async_copy_to_threadgroup(...)Async device → threadgroup copy
These are building blocks for advanced tiled GEMM and attention kernels. See Layout Algebra for the tiling workflow.

Device capabilities

Query hardware limits before using feature-gated operations:
caps = rt.device_capabilities()

Capability fields

FieldTypeDescription
gpu_familystrHuman-readable family, e.g. "apple9"
gpu_family_rawintRaw Metal GPU family integer
is_m3_or_newerboolWhether device is M3 or newer
supports_async_copyboolAsync threadgroup copy support
supports_simdgroup_matrixbool8×8 simdgroup matrix multiply support
simdgroup_sizeintTypically 32
max_threadgroup_memoryintBytes of threadgroup memory
max_threads_per_threadgroupintTypically 1024

Requiring capabilities

caps.require_m3("async_copy operations")
# raises RuntimeError if device is older than M3

Gating kernel paths

caps = rt.device_capabilities()
if caps.supports_simdgroup_matrix:
    compiled = enigma.compile(fast_gemm_kernel)
else:
    compiled = enigma.compile(naive_gemm_kernel)