Skip to main content
Async copy uses undocumented AIR intrinsics for non-blocking device-threadgroup data movement. Requires M3+ hardware.
Experimental. Uses __asm("air.simdgroup_async_copy_*") extern declarations (same technique as percisely.xyz/gemm). Apple may change or remove access to these intrinsics at any time.

Functions

FunctionDescription
enigma.async_copy_1d_d2t(dst, dst_offset, src, src_offset, count)1D device -> threadgroup. Returns event
enigma.async_copy_1d_t2d(dst, dst_offset, src, src_offset, count)1D threadgroup -> device. Returns event
enigma.async_copy_2d_d2t(dst, dst_off, dst_epr, src, src_off, src_epr, tile_cols, tile_rows)2D tile device -> threadgroup. Returns event
enigma.async_copy_2d_t2d(dst, dst_off, dst_epr, src, src_off, src_epr, tile_cols, tile_rows)2D tile threadgroup -> device. Returns event
enigma.async_copy_wait(*events)Block until events complete

Parameters

  • dst, src: Tensor, RegisterTensor, or buffer name string
  • dst_offset, src_offset: Element offset into the buffer
  • count: Number of elements (1D)
  • dst_epr, src_epr: Elements per row (2D stride)
  • tile_cols, tile_rows: Tile dimensions (2D)

Example

@enigma.kernel
def tiled_load(A: enigma.f32, B: enigma.f32):
    tile = enigma.threadgroup_alloc("float", 64)
    c0 = enigma.metal_cast(0, "uint")
    cnt = enigma.metal_cast(64, "uint")

    ev = enigma.async_copy_1d_d2t(tile, c0, A, c0, cnt)
    enigma.async_copy_wait(ev)
    enigma.barrier()

Legacy aliases

FunctionMaps to
enigma.async_copy_to_threadgroup(src, dst, count, ...)async_copy_1d_d2t
enigma.async_copy_commit(token)No-op (commit is implicit)