> ## Documentation Index
> Fetch the complete documentation index at: https://klyne-research.mintlify.site/llms.txt
> Use this file to discover all available pages before exploring further.

# Async Copy (Experimental)

> AIR-backed async device-threadgroup copy via simdgroup intrinsics.

Async copy uses undocumented AIR intrinsics for non-blocking device-threadgroup data movement. Requires M3+ hardware.

<Warning>
  Experimental. Uses `__asm("air.simdgroup_async_copy_*")` extern declarations (same technique as [percisely.xyz/gemm](https://percisely.xyz/gemm)). Apple may change or remove access to these intrinsics at any time.
</Warning>

## Functions

| Function                                                                                       | Description                                  |
| ---------------------------------------------------------------------------------------------- | -------------------------------------------- |
| `enigma.async_copy_1d_d2t(dst, dst_offset, src, src_offset, count)`                            | 1D device -> threadgroup. Returns event      |
| `enigma.async_copy_1d_t2d(dst, dst_offset, src, src_offset, count)`                            | 1D threadgroup -> device. Returns event      |
| `enigma.async_copy_2d_d2t(dst, dst_off, dst_epr, src, src_off, src_epr, tile_cols, tile_rows)` | 2D tile device -> threadgroup. Returns event |
| `enigma.async_copy_2d_t2d(dst, dst_off, dst_epr, src, src_off, src_epr, tile_cols, tile_rows)` | 2D tile threadgroup -> device. Returns event |
| `enigma.async_copy_wait(*events)`                                                              | Block until events complete                  |

## Parameters

* **dst, src**: Tensor, RegisterTensor, or buffer name string
* **dst\_offset, src\_offset**: Element offset into the buffer
* **count**: Number of elements (1D)
* **dst\_epr, src\_epr**: Elements per row (2D stride)
* **tile\_cols, tile\_rows**: Tile dimensions (2D)

## Example

```python theme={null}
@enigma.kernel
def tiled_load(A: enigma.f32, B: enigma.f32):
    tile = enigma.threadgroup_alloc("float", 64)
    c0 = enigma.metal_cast(0, "uint")
    cnt = enigma.metal_cast(64, "uint")

    ev = enigma.async_copy_1d_d2t(tile, c0, A, c0, cnt)
    enigma.async_copy_wait(ev)
    enigma.barrier()
```

## Legacy aliases

| Function                                                 | Maps to                    |
| -------------------------------------------------------- | -------------------------- |
| `enigma.async_copy_to_threadgroup(src, dst, count, ...)` | `async_copy_1d_d2t`        |
| `enigma.async_copy_commit(token)`                        | No-op (commit is implicit) |
