> ## Documentation Index
> Fetch the complete documentation index at: https://klyne-research.mintlify.site/llms.txt
> Use this file to discover all available pages before exploring further.

# Registers, Copy & Pipeline

> Register tensors, explicit data movement, and software pipelining.

## Register tensors

`enigma.register_tensor` creates a small fixed-size tensor backed by per-thread SSA values (registers). Unlike `Tensor` (device/threadgroup memory), indices must be compile-time constants.

```python theme={null}
enigma.register_tensor(shape, dtype="float", fill=0)
```

| Parameter | Type   | Default   | Description                    |
| --------- | ------ | --------- | ------------------------------ |
| `shape`   | tuple  | required  | Dimensions (e.g. `(4, 4)`)     |
| `dtype`   | str    | `"float"` | Element type                   |
| `fill`    | number | `0`       | Initial value for all elements |

```python theme={null}
acc = enigma.register_tensor((8, 8), dtype="float", fill=0.0)
with enigma.for_range(0, K) as k:
    for i in enigma.range_constexpr(8):
        for j in enigma.range_constexpr(8):
            acc[i, j] = enigma.fma(a_tile[i, k], b_tile[k, j], acc[i, j])
```

## Copy

`enigma.copy` moves elements between buffers (device or threadgroup) using a traced `for_range` loop.

```python theme={null}
enigma.copy(src, dst, count, src_offset=0, dst_offset=0, mask_fn=None, coalesced_width=1)
```

| Parameter                  | Type             | Default  | Description                                   |
| -------------------------- | ---------------- | -------- | --------------------------------------------- |
| `src`, `dst`               | Tensor           | required | Source and destination buffers                |
| `count`                    | int              | required | Number of elements                            |
| `src_offset`, `dst_offset` | int or IRValue   | `0`      | Base offsets                                  |
| `mask_fn`                  | callable or None | `None`   | Per-element predicate `fn(i) -> i1`           |
| `coalesced_width`          | int              | `1`      | Wider loads/stores per iteration (1, 2, or 4) |

```python theme={null}
tile = enigma.threadgroup_alloc("float", 256)
enigma.copy(A, tile, count=256, src_offset=block_start)
enigma.barrier()
```

## Pipeline

`enigma.pipeline` creates a multi-stage ring buffer of threadgroup tiles for compute/load overlap.

```python theme={null}
enigma.pipeline(dtype, size, stages=2)
```

| Parameter | Type | Default  | Description                      |
| --------- | ---- | -------- | -------------------------------- |
| `dtype`   | str  | required | Element type                     |
| `size`    | int  | required | Elements per tile                |
| `stages`  | int  | `2`      | Number of buffered stages (>= 2) |

| Method           | Description                                            |
| ---------------- | ------------------------------------------------------ |
| `pipe.front()`   | Current iteration's consume buffer (stage 0)           |
| `pipe.back()`    | Most-distant prefetch buffer (stage N-1)               |
| `pipe.stage(k)`  | Stage k buffer                                         |
| `pipe.advance()` | Rotate all stages by one (Python-side, no MSL emitted) |

```python theme={null}
pipe = enigma.pipeline("float", 256, stages=3)

# Prefill first two stages
for s in range(2):
    enigma.copy(A, pipe.stage(s), count=256, src_offset=s * 256)
enigma.barrier()

with enigma.for_range(0, NUM_TILES) as t:
    compute(pipe.front())
    enigma.copy(A, pipe.back(), count=256, src_offset=(t + 2) * 256)
    enigma.barrier()
    pipe.advance()
```
