Address spaces
| Space | Metal keyword | Scope | Use |
|---|---|---|---|
device | device | GPU-wide | Input/output buffers, default for kernel parameters |
constant | constant | Read-only, cached | Uniform data shared across all threads |
threadgroup | threadgroup | One threadgroup | Shared memory for cooperation within a threadgroup |
thread | thread | Single thread | Private register-level storage |
Device memory
All kernel buffer parameters (A: enigma.f32, etc.) are in device memory. Indexed access emits global loads and stores:
vec_width=4, Enigma emits device float4* paths where alignment allows, improving memory bandwidth utilization.
Threadgroup (shared) memory
Threadgroup memory is a scratchpad shared by all threads in a threadgroup. It is fast, limited (typically 32 KB), and requires explicit synchronization. Allocate withenigma.threadgroup_alloc:
- Always call
enigma.barrier()between a write and a cross-thread read. - Allocation size is fixed at compile time.
- The
mem_flagsargument controls which memory types the barrier synchronizes.
Barrier flags
| Flag | Synchronizes |
|---|---|
"mem_none" | Execution order only |
"mem_device" | Device (global) memory |
"mem_threadgroup" | Threadgroup (shared) memory |
"mem_device_and_threadgroup" | Both |
Scalar parameters
enigma.Scalar(dtype) parameters are lowered as 1-element device buffers and auto-loaded at kernel entry. Use them for per-dispatch constants that vary across calls:
Higher-level memory primitives
The DSL provides register-and-pipeline building blocks for staged compute patterns:| Primitive | Purpose |
|---|---|
enigma.register_tensor(...) | Register-level tile allocation |
enigma.copy(src, dst, count=...) | Explicit data movement |
enigma.pipeline(dtype, size, stages=...) | Software pipeline staging |
enigma.async_copy_to_threadgroup(...) | Async device → threadgroup copy |
Device capabilities
Query hardware limits before using feature-gated operations:Capability fields
| Field | Type | Description |
|---|---|---|
gpu_family | str | Human-readable family, e.g. "apple9" |
gpu_family_raw | int | Raw Metal GPU family integer |
is_m3_or_newer | bool | Whether device is M3 or newer |
supports_async_copy | bool | Async threadgroup copy support |
supports_simdgroup_matrix | bool | 8×8 simdgroup matrix multiply support |
simdgroup_size | int | Typically 32 |
max_threadgroup_memory | int | Bytes of threadgroup memory |
max_threads_per_threadgroup | int | Typically 1024 |
