> ## Documentation Index
> Fetch the complete documentation index at: https://klyne-research.mintlify.site/llms.txt
> Use this file to discover all available pages before exploring further.

# First Kernel

> Write, compile, and dispatch a minimal Enigma kernel end to end.

This page walks through writing, compiling, and running a vector-add kernel — the minimal end-to-end path in Enigma.

## Step 1: define the kernel

Decorate a Python function with `@enigma.kernel`. The function body is **traced**, not executed: each expression becomes an IR node, not a Python value.

```python theme={null}
import enigma

@enigma.kernel
def vector_add(A: enigma.f32, B: enigma.f32, C: enigma.f32):
    tid = enigma.thread_position_in_grid
    C[tid] = A[tid] + B[tid]
```

What this does:

* `A`, `B`, `C` are typed buffer parameters (`f32` = `float` in Metal)
* `thread_position_in_grid` returns the global thread ID as an `IRValue`
* Indexing `A[tid]` emits a load; assigning `C[tid] = ...` emits a store

## Step 2: compile to a Metal library

```python theme={null}
compiled = enigma.compile(vector_add)
```

`enigma.compile` runs the full pipeline:

1. Traces the Python function to an IR
2. Lowers the IR to the Enigma MLIR dialect
3. Emits Metal Shading Language (MSL) source
4. Invokes `xcrun metal` and `xcrun metallib`
5. Returns a `CompiledKernel` with all artifacts

### Inspect the output

```python theme={null}
print(compiled.kernel_name)        # enigma_kernel_vector_add
print(compiled.metal_source[:500]) # generated .metal source
```

### Compile with verbose output

```python theme={null}
compiled = enigma.compile(
    vector_add,
    dump_ir=True,
    dump_mlir=True,
    keep_metal_source=True,
    work_dir="./build/enigma",
)
```

## Step 3: dispatch

```python theme={null}
import numpy as np

rt = enigma.MetalRuntime()
n = 4096

a = np.random.randn(n).astype(np.float32)
b = np.random.randn(n).astype(np.float32)

raw = rt.execute(
    compiled,
    inputs=[a, b],
    output_size=n * 4,       # bytes: 4096 elements × 4 bytes
    grid=(n, 1, 1),          # one thread per element
    threads=(256, 1, 1),     # threadgroup size
)
out = np.frombuffer(raw, dtype=np.float32)
```

### Key parameters

| Parameter     | Meaning                                                 |
| ------------- | ------------------------------------------------------- |
| `inputs`      | List of numpy arrays passed as read-only device buffers |
| `output_size` | Size in **bytes** of the output buffer                  |
| `grid`        | `(gx, gy, gz)` — total threads in each dimension        |
| `threads`     | `(tx, ty, tz)` — threads per threadgroup                |

## Step 4: validate

```python theme={null}
np.testing.assert_allclose(out, a + b, rtol=1e-5, atol=1e-7)
print("pass")
```

## Common first-run mistakes

| Mistake                         | Symptom                           |
| ------------------------------- | --------------------------------- |
| `output_size` too small         | Partial result or garbage at tail |
| `grid` smaller than data domain | Some elements not written         |
| Wrong dtype in `np.frombuffer`  | Nonsensical values                |
| Running on non-Metal machine    | `MetalRuntime` dispatch exception |

## Next steps

* [Compile-Only Workflow](/getting-started/compile-only-workflow) — iterate without a GPU
* [Execution Model](/concepts/execution-model) — understand grid, threadgroups, and index queries
* [Examples](/examples) — more complete kernel patterns
