> ## Documentation Index
> Fetch the complete documentation index at: https://klyne-research.mintlify.site/llms.txt
> Use this file to discover all available pages before exploring further.

# Layout Algebra

> CuTe-inspired Layout type and tiling operations for writing efficient tiled kernels.

Enigma's layout algebra is inspired by NVIDIA's CuTe library. A `Layout` is a `(shape, stride)` pair that maps multi-dimensional logical coordinates to linear memory offsets. Composing and dividing layouts is how you express tiling, thread-value partitioning, and vectorized access patterns.

## The Layout type

```python theme={null}
L = enigma.Layout(shape, stride)
```

A layout with `shape=(4, 8)` and `stride=(1, 4)` maps coordinate `(r, c)` to offset `r*1 + c*4`.

```python theme={null}
L = enigma.Layout((4, 8), (1, 4))
L((0, 0))   # → 0
L((1, 0))   # → 1
L((0, 1))   # → 4
L((2, 3))   # → 2*1 + 3*4 = 14
```

### Total size

```python theme={null}
enigma.size(L)           # total elements: 4 * 8 = 32
enigma.size(L, mode=[0]) # size of mode 0: 4
enigma.size(L, mode=[1]) # size of mode 1: 8
```

## Creating layouts

| Function                                   | Description                              |
| ------------------------------------------ | ---------------------------------------- |
| `enigma.Layout(shape, stride)`             | Explicit shape and stride                |
| `enigma.make_layout(shape, stride=None)`   | Alias; omit stride for row-major default |
| `enigma.make_ordered_layout(shape, order)` | Custom dimension ordering                |
| `enigma.make_identity_layout(shape)`       | Column-major (order = `(0, 1, ...)`)     |

```python theme={null}
# Row-major 4×8
row_major = enigma.Layout((4, 8), (8, 1))

# Column-major 4×8
col_major = enigma.Layout((4, 8), (1, 4))

# Custom order: slow-dim first, fast-dim second
ordered = enigma.make_ordered_layout((4, 64), order=(1, 0))  # 4 rows, 64 cols, rows vary fastest
```

## Transforming layouts

### `coalesce`

Merges adjacent modes with compatible strides into a single flat mode. Use after composition to simplify the layout.

```python theme={null}
L = enigma.Layout(((4, 8),), ((1, 4),))  # nested shape
L2 = enigma.coalesce(L)                   # → Layout((32,), (1,))
```

### `complement`

Returns the layout that covers the elements **not** covered by the input layout, within a given total size.

```python theme={null}
Lc = enigma.complement(L)
```

### `logical_divide` and `zipped_divide`

Split a layout into a `(tile, rest)` pair. `zipped_divide` applies the division per mode and returns a layout whose first mode is the tile and second mode is the rest.

```python theme={null}
tile = (16, 8)
Lt = enigma.zipped_divide(L, tile)
# Lt shape: ((tile_m, rest_m), (tile_n, rest_n))
```

### `composition`

Compose two layouts: `composition(a, b)` treats `b` as a re-indexing of `a`.

```python theme={null}
composed = enigma.composition(a, b)
```

### `blocked_product`

Compute the blocked outer product of two layouts:

```python theme={null}
Lb = enigma.blocked_product(a, b)
```

### `recast_layout`

Rescale a layout for a different element bit width. For example, viewing a `float32` layout as a `float16` layout:

```python theme={null}
L_f16 = enigma.recast_layout(new_bits=16, old_bits=32, layout=L_f32)
```

## Thread-value layout

`make_layout_tv` is the central tiling primitive. It takes a thread layout and a value layout and returns:

1. A **tiler** describing the tile shape at each level
2. A **TV layout** mapping `(thread_id, value_id)` → tile coordinate

```python theme={null}
thr = enigma.make_ordered_layout((4, 64), order=(1, 0))   # 256 threads in 4×64
val = enigma.make_ordered_layout((4, 4), order=(1, 0))    # 4×4 values per thread

tiler_mn, tv_layout = enigma.make_layout_tv(thr, val)
```

### Using TV layouts in `@enigma.jit`

```python theme={null}
@enigma.jit
def tiled_kernel(mA, mB, mC):
    thr = enigma.make_ordered_layout((4, 64), order=(1, 0))
    val = enigma.make_ordered_layout((4, 4), order=(1, 0))
    tiler_mn, tv_layout = enigma.make_layout_tv(thr, val)

    # Partition tensor into blocks
    gA = enigma.tensor_zipped_divide(mA, tiler_mn)
    block_idx = ...

    # Slice one block
    blkA = gA[((None, None), block_idx)]

    # Per-thread fragment
    thrA = enigma.tensor_composition(blkA, tv_layout, tiler_mn)[(thread_idx, None)]
```

### Tensor operations

| Function                                       | Description                          |
| ---------------------------------------------- | ------------------------------------ |
| `enigma.tensor_zipped_divide(tensor, tiler)`   | Partition tensor into tiles          |
| `enigma.tensor_composition(tensor, tv, tiler)` | Map thread-value indices to tensor   |
| `tensor.load()`                                | Vectorized load of a thread fragment |
| `tensor.store(value)`                          | Vectorized store                     |

## Tiling workflow summary

1. Define thread and value layouts with `make_ordered_layout`
2. Call `make_layout_tv` to get tiler and TV layout
3. In `@enigma.jit`, use `tensor_zipped_divide` to partition the global tensor
4. Slice per-block with `blk = gtensor[((None, None), block_idx)]`
5. Compose with TV layout to get per-thread fragment
6. Use `.load()` and `.store()` on the fragment

## Debugging layouts

When a tiling produces unexpected output, print shape and stride at each step:

```python theme={null}
print(tiler_mn)      # Layout((tile_m, tile_n), ...)
print(tv_layout)     # Layout(((thr_m, thr_n), (val_m, val_n)), ...)
print(enigma.size(tv_layout, mode=[0]))   # number of threads
print(enigma.size(tv_layout, mode=[1]))   # values per thread
```

**Invariant:** If any tiler dimension exceeds the tensor dimension, Enigma raises an `EnigmaError` rather than silently producing an invalid layout.
