MetalRuntime - Enigma

MetalRuntime manages the Metal device, command queue, and buffer lifecycle. It dispatches compiled kernels and returns output data.

Constructor

rt = enigma.MetalRuntime(dylib_path: str | None = None)

dylib_path overrides the default path to the Swift runtime dylib. Leave it None in almost all cases.

`execute()`

One-shot dispatch. Allocates buffers, runs the kernel, reads output, and returns raw bytes.

raw: bytes = rt.execute(
    compiled: CompiledKernel,
    inputs: list[np.ndarray],
    output_size: int,
    grid: tuple[int, int, int],
    threads: tuple[int, int, int],
) -> bytes

Parameters

Parameter	Type	Description
`compiled`	`CompiledKernel`	Compiled kernel from `enigma.compile()`
`inputs`	`list[np.ndarray]`	Input buffers in kernel parameter order
`output_size`	`int`	Output buffer size in bytes
`grid`	`tuple`	`(gx, gy, gz)` — total threads
`threads`	`tuple`	`(tx, ty, tz)` — threads per threadgroup

Returns

Raw bytes of the output buffer. Convert with np.frombuffer:

raw = rt.execute(compiled, [a, b], n * 4, (n, 1, 1), (256, 1, 1))
out = np.frombuffer(raw, dtype=np.float32)

`prepare()`

Pre-allocate Metal buffers for repeated dispatch. Use this in hot loops to avoid per-call allocation overhead.

prepared: PreparedKernel = rt.prepare(
    compiled: CompiledKernel,
    inputs: list[np.ndarray],
    output_size: int,
)

`device_capabilities()`

Query hardware features of the active Metal device.

caps: DeviceCapabilities = rt.device_capabilities()

See DeviceCapabilities below.

`PreparedKernel`

Returned by rt.prepare(...). Holds pre-allocated Metal buffers.

`dispatch()`

prepared.dispatch(
    grid: tuple[int, int, int],
    threads: tuple[int, int, int],
)

Dispatches the kernel. Does not return output — call read_output() afterward.

`dispatch_timed()`

elapsed_ms: float = prepared.dispatch_timed(
    grid: tuple[int, int, int],
    threads: tuple[int, int, int],
)

Dispatches the kernel and returns wall-clock time in milliseconds.

`read_output()`

raw: bytes = prepared.read_output()

Reads the output buffer back to host memory.

Example: benchmarking loop

prepared = rt.prepare(compiled, [a, b], output_size=n * 4)

# Warm up
prepared.dispatch(grid=(n, 1, 1), threads=(256, 1, 1))

# Timed runs
times = [prepared.dispatch_timed(grid=(n, 1, 1), threads=(256, 1, 1)) for _ in range(10)]
print(f"mean: {sum(times)/len(times):.3f} ms")

result = np.frombuffer(prepared.read_output(), dtype=np.float32)

`DeviceCapabilities`

Fields

Field	Type	Description
`gpu_family`	`str`	Human-readable GPU family, e.g. `"apple9"`
`gpu_family_raw`	`int`	Metal GPU family integer
`is_m3_or_newer`	`bool`	True if device is M3 chip or newer
`supports_async_copy`	`bool`	Async threadgroup copy support (M3+)
`supports_simdgroup_matrix`	`bool`	8×8 simdgroup matrix multiply
`simdgroup_size`	`int`	SIMD group size (typically 32)
`max_threadgroup_memory`	`int`	Max threadgroup memory in bytes
`max_threads_per_threadgroup`	`int`	Typically 1024

`require_m3()`

caps.require_m3(feature_description: str)

Raises RuntimeError if the device is older than M3. Pass a description of what you’re gating:

caps.require_m3("async copy operations require M3 or newer")

Example

caps = rt.device_capabilities()
print(f"GPU: {caps.gpu_family}, SIMD size: {caps.simdgroup_size}")

if caps.supports_simdgroup_matrix:
    compiled = enigma.compile(fast_gemm)
else:
    compiled = enigma.compile(fallback_gemm)

Dispatch errors

When dispatch fails, rt.execute() raises a RuntimeError containing:

Kernel name
grid and threads used
The underlying Metal error code and description

Check the kernel name and geometry first — most dispatch failures are launch-shape mismatches.

​Constructor

​execute()

​Parameters

​Returns

​prepare()

​device_capabilities()

​PreparedKernel

​dispatch()

​dispatch_timed()

​read_output()

​Example: benchmarking loop

​DeviceCapabilities

​Fields

​require_m3()

​Example

​Dispatch errors