Skip to main content
MetalRuntime manages the Metal device, command queue, and buffer lifecycle. It dispatches compiled kernels and returns output data.

Constructor

rt = enigma.MetalRuntime(dylib_path: str | None = None)
dylib_path overrides the default path to the Swift runtime dylib. Leave it None in almost all cases.

execute()

One-shot dispatch. Allocates buffers, runs the kernel, reads output, and returns raw bytes.
raw: bytes = rt.execute(
    compiled: CompiledKernel,
    inputs: list[np.ndarray],
    output_size: int,
    grid: tuple[int, int, int],
    threads: tuple[int, int, int],
) -> bytes

Parameters

ParameterTypeDescription
compiledCompiledKernelCompiled kernel from enigma.compile()
inputslist[np.ndarray]Input buffers in kernel parameter order
output_sizeintOutput buffer size in bytes
gridtuple(gx, gy, gz) — total threads
threadstuple(tx, ty, tz) — threads per threadgroup

Returns

Raw bytes of the output buffer. Convert with np.frombuffer:
raw = rt.execute(compiled, [a, b], n * 4, (n, 1, 1), (256, 1, 1))
out = np.frombuffer(raw, dtype=np.float32)

prepare()

Pre-allocate Metal buffers for repeated dispatch. Use this in hot loops to avoid per-call allocation overhead.
prepared: PreparedKernel = rt.prepare(
    compiled: CompiledKernel,
    inputs: list[np.ndarray],
    output_size: int,
)

device_capabilities()

Query hardware features of the active Metal device.
caps: DeviceCapabilities = rt.device_capabilities()
See DeviceCapabilities below.

PreparedKernel

Returned by rt.prepare(...). Holds pre-allocated Metal buffers.

dispatch()

prepared.dispatch(
    grid: tuple[int, int, int],
    threads: tuple[int, int, int],
)
Dispatches the kernel. Does not return output — call read_output() afterward.

dispatch_timed()

elapsed_ms: float = prepared.dispatch_timed(
    grid: tuple[int, int, int],
    threads: tuple[int, int, int],
)
Dispatches the kernel and returns wall-clock time in milliseconds.

read_output()

raw: bytes = prepared.read_output()
Reads the output buffer back to host memory.

Example: benchmarking loop

prepared = rt.prepare(compiled, [a, b], output_size=n * 4)

# Warm up
prepared.dispatch(grid=(n, 1, 1), threads=(256, 1, 1))

# Timed runs
times = [prepared.dispatch_timed(grid=(n, 1, 1), threads=(256, 1, 1)) for _ in range(10)]
print(f"mean: {sum(times)/len(times):.3f} ms")

result = np.frombuffer(prepared.read_output(), dtype=np.float32)

DeviceCapabilities

Fields

FieldTypeDescription
gpu_familystrHuman-readable GPU family, e.g. "apple9"
gpu_family_rawintMetal GPU family integer
is_m3_or_newerboolTrue if device is M3 chip or newer
supports_async_copyboolAsync threadgroup copy support (M3+)
supports_simdgroup_matrixbool8×8 simdgroup matrix multiply
simdgroup_sizeintSIMD group size (typically 32)
max_threadgroup_memoryintMax threadgroup memory in bytes
max_threads_per_threadgroupintTypically 1024

require_m3()

caps.require_m3(feature_description: str)
Raises RuntimeError if the device is older than M3. Pass a description of what you’re gating:
caps.require_m3("async copy operations require M3 or newer")

Example

caps = rt.device_capabilities()
print(f"GPU: {caps.gpu_family}, SIMD size: {caps.simdgroup_size}")

if caps.supports_simdgroup_matrix:
    compiled = enigma.compile(fast_gemm)
else:
    compiled = enigma.compile(fallback_gemm)

Dispatch errors

When dispatch fails, rt.execute() raises a RuntimeError containing:
  • Kernel name
  • grid and threads used
  • The underlying Metal error code and description
Check the kernel name and geometry first — most dispatch failures are launch-shape mismatches.