> ## Documentation Index
> Fetch the complete documentation index at: https://klyne-research.mintlify.site/llms.txt
> Use this file to discover all available pages before exploring further.

# MetalRuntime

> Full reference for MetalRuntime, PreparedKernel, and DeviceCapabilities.

`MetalRuntime` manages the Metal device, command queue, and buffer lifecycle. It dispatches compiled kernels and returns output data.

## Constructor

```python theme={null}
rt = enigma.MetalRuntime(dylib_path: str | None = None)
```

`dylib_path` overrides the default path to the Swift runtime dylib. Leave it `None` in almost all cases.

***

## `execute()`

One-shot dispatch. Allocates buffers, runs the kernel, reads output, and returns raw bytes.

```python theme={null}
raw: bytes = rt.execute(
    compiled: CompiledKernel,
    inputs: list[np.ndarray],
    output_size: int,
    grid: tuple[int, int, int],
    threads: tuple[int, int, int],
) -> bytes
```

### Parameters

| Parameter     | Type               | Description                              |
| ------------- | ------------------ | ---------------------------------------- |
| `compiled`    | `CompiledKernel`   | Compiled kernel from `enigma.compile()`  |
| `inputs`      | `list[np.ndarray]` | Input buffers in kernel parameter order  |
| `output_size` | `int`              | Output buffer size in **bytes**          |
| `grid`        | `tuple`            | `(gx, gy, gz)` — total threads           |
| `threads`     | `tuple`            | `(tx, ty, tz)` — threads per threadgroup |

### Returns

Raw `bytes` of the output buffer. Convert with `np.frombuffer`:

```python theme={null}
raw = rt.execute(compiled, [a, b], n * 4, (n, 1, 1), (256, 1, 1))
out = np.frombuffer(raw, dtype=np.float32)
```

***

## `prepare()`

Pre-allocate Metal buffers for repeated dispatch. Use this in hot loops to avoid per-call allocation overhead.

```python theme={null}
prepared: PreparedKernel = rt.prepare(
    compiled: CompiledKernel,
    inputs: list[np.ndarray],
    output_size: int,
)
```

***

## `device_capabilities()`

Query hardware features of the active Metal device.

```python theme={null}
caps: DeviceCapabilities = rt.device_capabilities()
```

See [`DeviceCapabilities`](#devicecapabilities) below.

***

## `PreparedKernel`

Returned by `rt.prepare(...)`. Holds pre-allocated Metal buffers.

### `dispatch()`

```python theme={null}
prepared.dispatch(
    grid: tuple[int, int, int],
    threads: tuple[int, int, int],
)
```

Dispatches the kernel. Does not return output — call `read_output()` afterward.

### `dispatch_timed()`

```python theme={null}
elapsed_ms: float = prepared.dispatch_timed(
    grid: tuple[int, int, int],
    threads: tuple[int, int, int],
)
```

Dispatches the kernel and returns wall-clock time in milliseconds.

### `read_output()`

```python theme={null}
raw: bytes = prepared.read_output()
```

Reads the output buffer back to host memory.

### Example: benchmarking loop

```python theme={null}
prepared = rt.prepare(compiled, [a, b], output_size=n * 4)

# Warm up
prepared.dispatch(grid=(n, 1, 1), threads=(256, 1, 1))

# Timed runs
times = [prepared.dispatch_timed(grid=(n, 1, 1), threads=(256, 1, 1)) for _ in range(10)]
print(f"mean: {sum(times)/len(times):.3f} ms")

result = np.frombuffer(prepared.read_output(), dtype=np.float32)
```

***

## `DeviceCapabilities`

### Fields

| Field                         | Type   | Description                                |
| ----------------------------- | ------ | ------------------------------------------ |
| `gpu_family`                  | `str`  | Human-readable GPU family, e.g. `"apple9"` |
| `gpu_family_raw`              | `int`  | Metal GPU family integer                   |
| `is_m3_or_newer`              | `bool` | True if device is M3 chip or newer         |
| `supports_async_copy`         | `bool` | Async threadgroup copy support (M3+)       |
| `supports_simdgroup_matrix`   | `bool` | 8×8 simdgroup matrix multiply              |
| `simdgroup_size`              | `int`  | SIMD group size (typically 32)             |
| `max_threadgroup_memory`      | `int`  | Max threadgroup memory in bytes            |
| `max_threads_per_threadgroup` | `int`  | Typically 1024                             |

### `require_m3()`

```python theme={null}
caps.require_m3(feature_description: str)
```

Raises `RuntimeError` if the device is older than M3. Pass a description of what you're gating:

```python theme={null}
caps.require_m3("async copy operations require M3 or newer")
```

### Example

```python theme={null}
caps = rt.device_capabilities()
print(f"GPU: {caps.gpu_family}, SIMD size: {caps.simdgroup_size}")

if caps.supports_simdgroup_matrix:
    compiled = enigma.compile(fast_gemm)
else:
    compiled = enigma.compile(fallback_gemm)
```

***

## Dispatch errors

When dispatch fails, `rt.execute()` raises a `RuntimeError` containing:

* Kernel name
* `grid` and `threads` used
* The underlying Metal error code and description

Check the kernel name and geometry first — most dispatch failures are launch-shape mismatches.
