Memory Model - Enigma

Metal defines four address spaces that correspond to different physical memory locations on the GPU. Enigma maps these to distinct kernel operations.

Address spaces

Space	Metal keyword	Scope	Use
`device`	`device`	GPU-wide	Input/output buffers, default for kernel parameters
`constant`	`constant`	Read-only, cached	Uniform data shared across all threads
`threadgroup`	`threadgroup`	One threadgroup	Shared memory for cooperation within a threadgroup
`thread`	`thread`	Single thread	Private register-level storage

Device memory

All kernel buffer parameters (A: enigma.f32, etc.) are in device memory. Indexed access emits global loads and stores:

val = A[tid]        # device load
C[tid] = val        # device store

With vec_width=4, Enigma emits device float4* paths where alignment allows, improving memory bandwidth utilization.

Threadgroup (shared) memory

Threadgroup memory is a scratchpad shared by all threads in a threadgroup. It is fast, limited (typically 32 KB), and requires explicit synchronization. Allocate with enigma.threadgroup_alloc:

@enigma.kernel
def reduce(A: enigma.f32, Out: enigma.f32):
    local_id = enigma.thread_position_in_threadgroup("x")
    
    # Allocate 256 floats in shared memory
    scratch = enigma.threadgroup_alloc("float", 256)
    
    # Load from device into shared memory
    tid = enigma.thread_position_in_grid
    scratch[local_id] = A[tid]
    
    # Synchronize before reading neighbors
    enigma.barrier(mem_flags="mem_threadgroup")
    
    # Read neighbor
    neighbor = scratch[local_id + 1]
    Out[tid] = neighbor

Key rules:

Always call enigma.barrier() between a write and a cross-thread read.
Allocation size is fixed at compile time.
The mem_flags argument controls which memory types the barrier synchronizes.

Barrier flags

Flag	Synchronizes
`"mem_none"`	Execution order only
`"mem_device"`	Device (global) memory
`"mem_threadgroup"`	Threadgroup (shared) memory
`"mem_device_and_threadgroup"`	Both

enigma.barrier(mem_flags="mem_threadgroup")      # threadgroup barrier
enigma.simd_barrier(mem_flags="mem_threadgroup") # SIMD group barrier

Scalar parameters

enigma.Scalar(dtype) parameters are lowered as 1-element device buffers and auto-loaded at kernel entry. Use them for per-dispatch constants that vary across calls:

@enigma.kernel
def scale(A: enigma.f32, factor: enigma.Scalar(enigma.f32), C: enigma.f32):
    tid = enigma.thread_position_in_grid
    C[tid] = A[tid] * factor

Higher-level memory primitives

The DSL provides register-and-pipeline building blocks for staged compute patterns:

Primitive	Purpose
`enigma.register_tensor(...)`	Register-level tile allocation
`enigma.copy(src, dst, count=...)`	Explicit data movement
`enigma.pipeline(dtype, size, stages=...)`	Software pipeline staging
`enigma.async_copy_to_threadgroup(...)`	Async device → threadgroup copy

These are building blocks for advanced tiled GEMM and attention kernels. See Layout Algebra for the tiling workflow.

Device capabilities

Query hardware limits before using feature-gated operations:

caps = rt.device_capabilities()

Capability fields

Field	Type	Description
`gpu_family`	`str`	Human-readable family, e.g. `"apple9"`
`gpu_family_raw`	`int`	Raw Metal GPU family integer
`is_m3_or_newer`	`bool`	Whether device is M3 or newer
`supports_async_copy`	`bool`	Async threadgroup copy support
`supports_simdgroup_matrix`	`bool`	8×8 simdgroup matrix multiply support
`simdgroup_size`	`int`	Typically 32
`max_threadgroup_memory`	`int`	Bytes of threadgroup memory
`max_threads_per_threadgroup`	`int`	Typically 1024

Requiring capabilities

caps.require_m3("async_copy operations")
# raises RuntimeError if device is older than M3

Gating kernel paths

caps = rt.device_capabilities()
if caps.supports_simdgroup_matrix:
    compiled = enigma.compile(fast_gemm_kernel)
else:
    compiled = enigma.compile(naive_gemm_kernel)

​Address spaces

​Device memory

​Threadgroup (shared) memory

​Barrier flags

​Scalar parameters

​Higher-level memory primitives

​Device capabilities

​Capability fields

​Requiring capabilities

​Gating kernel paths