> ## Documentation Index
> Fetch the complete documentation index at: https://klyne-research.mintlify.site/llms.txt
> Use this file to discover all available pages before exploring further.

# Memory Model

> Address spaces, shared memory, and device capability checks in Enigma.

Metal defines four address spaces that correspond to different physical memory locations on the GPU. Enigma maps these to distinct kernel operations.

## Address spaces

| Space         | Metal keyword | Scope             | Use                                                 |
| ------------- | ------------- | ----------------- | --------------------------------------------------- |
| `device`      | `device`      | GPU-wide          | Input/output buffers, default for kernel parameters |
| `constant`    | `constant`    | Read-only, cached | Uniform data shared across all threads              |
| `threadgroup` | `threadgroup` | One threadgroup   | Shared memory for cooperation within a threadgroup  |
| `thread`      | `thread`      | Single thread     | Private register-level storage                      |

### Device memory

All kernel buffer parameters (`A: enigma.f32`, etc.) are in device memory. Indexed access emits global loads and stores:

```python theme={null}
val = A[tid]        # device load
C[tid] = val        # device store
```

With `vec_width=4`, Enigma emits `device float4*` paths where alignment allows, improving memory bandwidth utilization.

### Threadgroup (shared) memory

Threadgroup memory is a scratchpad shared by all threads in a threadgroup. It is fast, limited (typically 32 KB), and requires explicit synchronization.

Allocate with `enigma.threadgroup_alloc`:

```python theme={null}
@enigma.kernel
def reduce(A: enigma.f32, Out: enigma.f32):
    local_id = enigma.thread_position_in_threadgroup("x")
    
    # Allocate 256 floats in shared memory
    scratch = enigma.threadgroup_alloc("float", 256)
    
    # Load from device into shared memory
    tid = enigma.thread_position_in_grid
    scratch[local_id] = A[tid]
    
    # Synchronize before reading neighbors
    enigma.barrier(mem_flags="mem_threadgroup")
    
    # Read neighbor
    neighbor = scratch[local_id + 1]
    Out[tid] = neighbor
```

Key rules:

* Always call `enigma.barrier()` between a write and a cross-thread read.
* Allocation size is fixed at compile time.
* The `mem_flags` argument controls which memory types the barrier synchronizes.

### Barrier flags

| Flag                           | Synchronizes                |
| ------------------------------ | --------------------------- |
| `"mem_none"`                   | Execution order only        |
| `"mem_device"`                 | Device (global) memory      |
| `"mem_threadgroup"`            | Threadgroup (shared) memory |
| `"mem_device_and_threadgroup"` | Both                        |

```python theme={null}
enigma.barrier(mem_flags="mem_threadgroup")      # threadgroup barrier
enigma.simd_barrier(mem_flags="mem_threadgroup") # SIMD group barrier
```

### Scalar parameters

`enigma.Scalar(dtype)` parameters are lowered as 1-element device buffers and auto-loaded at kernel entry. Use them for per-dispatch constants that vary across calls:

```python theme={null}
@enigma.kernel
def scale(A: enigma.f32, factor: enigma.Scalar(enigma.f32), C: enigma.f32):
    tid = enigma.thread_position_in_grid
    C[tid] = A[tid] * factor
```

## Higher-level memory primitives

The DSL provides register-and-pipeline building blocks for staged compute patterns:

| Primitive                                  | Purpose                         |
| ------------------------------------------ | ------------------------------- |
| `enigma.register_tensor(...)`              | Register-level tile allocation  |
| `enigma.copy(src, dst, count=...)`         | Explicit data movement          |
| `enigma.pipeline(dtype, size, stages=...)` | Software pipeline staging       |
| `enigma.async_copy_to_threadgroup(...)`    | Async device → threadgroup copy |

These are building blocks for advanced tiled GEMM and attention kernels. See [Layout Algebra](/programming-guide/layout-algebra) for the tiling workflow.

## Device capabilities

Query hardware limits before using feature-gated operations:

```python theme={null}
caps = rt.device_capabilities()
```

### Capability fields

| Field                         | Type   | Description                            |
| ----------------------------- | ------ | -------------------------------------- |
| `gpu_family`                  | `str`  | Human-readable family, e.g. `"apple9"` |
| `gpu_family_raw`              | `int`  | Raw Metal GPU family integer           |
| `is_m3_or_newer`              | `bool` | Whether device is M3 or newer          |
| `supports_async_copy`         | `bool` | Async threadgroup copy support         |
| `supports_simdgroup_matrix`   | `bool` | 8×8 simdgroup matrix multiply support  |
| `simdgroup_size`              | `int`  | Typically 32                           |
| `max_threadgroup_memory`      | `int`  | Bytes of threadgroup memory            |
| `max_threads_per_threadgroup` | `int`  | Typically 1024                         |

### Requiring capabilities

```python theme={null}
caps.require_m3("async_copy operations")
# raises RuntimeError if device is older than M3
```

### Gating kernel paths

```python theme={null}
caps = rt.device_capabilities()
if caps.supports_simdgroup_matrix:
    compiled = enigma.compile(fast_gemm_kernel)
else:
    compiled = enigma.compile(naive_gemm_kernel)
```
