The Context-Ready Transformer

Mahesh Godavarti
A Carrot, Inc
m@acarrot.com

Abstract

We introduce the context-ready transformer, a new recurrent neural network architecture built from a $D$ -layer transformer block that pre-contextualizes each token before it enters the block. During left-to-right generation, a correction network combines the previous position’s block output—a cached summary of past context—with the current token embedding, so the token enters the block already contextualized rather than as a raw embedding. At sequential inference, the correction chain makes the architecture a recurrent neural network. For training, we unroll the correction process $K$ times over the full sequence, processing all positions in parallel at each step. A pretrained transformer can also be converted to a context-ready model by adding a zero-initialized correction FFN and fine-tuning. We evaluate across widths, depths, block sizes, and two datasets, with all comparisons against standard transformers, variants, and ablations. A $D{=}5$ model beats a 12-layer transformer while generating $1.7\times$ faster on an A100. With $K{=}10$ , a single-layer model ( $D{=}1$ ) beats a 6-layer transformer with a $2.6\times$ inference speedup, and sequential inference matches parallel $K{=}10$ to within 0.01 PPL. The architecture benefits most from wide representations and long contexts. On a pointer-chasing task, $D{=}1$ trained with BPTT solves all 10 composition levels, while standard transformers exhibit staircase-like depth dependence.

1 Introduction

In autoregressive generation, a standard $N$ -layer transformer predicts the next token, assigns it a context-free base embedding, and devotes part or all of its $N$ layers to re-contextualize it. This round-trip from context to token ID back to context is an artifact of the architecture, not of the problem.

Context-ready inference. The context-ready transformer shortens this round-trip. Consider sequential left-to-right generation: as the model processes token $t$ , the block output $z_{t}$ encodes the context of tokens $0,\ldots,t$ . When token $t{+}1$ arrives with embedding $e_{t+1}$ , a correction network computes a correction from $z_{t}$ and $e_{t+1}$ . The token enters the block as $e_{t+1}$ plus this correction—carrying contextual information from all preceding tokens—rather than as a raw embedding. Fewer layers are needed to fully contextualize the token, because the correction has already done part of the work.

The training challenge. This sequential process is exact at inference but inherently serial: each position’s correction depends on the previous position’s fully computed block output. A classical RNN faces the same issue and solves it with backpropagation through time (BPTT), which unrolls the recurrence for all $T$ positions—making the sequential training depth scale with sequence length.

We take a different approach. We approximate the sequential process by unrolling it $K$ times over the full sequence, where $K$ is a small constant independent of $T$ . All $T$ positions are still processed in parallel—just as in a standard transformer—but the correction is refined over $K$ unrolling steps rather than $T$ . Starting from raw embeddings at all positions, we run a shared-weight block and compute the correction at each position $t$ from the block output at positions $0,\ldots,t{-}1$ . We then update all embeddings with these corrections and repeat. Each unrolling step refines the corrections using increasingly contextualized inputs. Crucially, $K$ controls the depth of the computation graph—not $T$ —so a classical RNN requires $O(T)$ -deep unrolling while the context-ready transformer requires only $O(K)$ . In practice, $K=5$ suffices for convergence at the depths tested ( $D\geq 5$ ; see Table 6).

How $K$ unrolling steps at training lead to zero iterations at inference. Two design choices make this possible.

•

Non-cumulative correction. Each iteration computes a correction from scratch rather than building on the previous one. The iteration takes the form $x^{(k)}=e+F_{\theta}(x^{(k-1)},e)$ : the base embedding plus a correction that depends on both the previous iterate’s block output and the token embedding. Unlike a residual network, which computes $x^{(K)}=e+f(x^{(0)})+\cdots+f(x^{(K-1)})$ —a sum of $K$ corrections—our formulation yields $x^{(K)}=e+F_{\theta}(x^{(K-1)},e)$ : a single correction. Previous iterations serve only to bring $x^{(K-1)}$ close to the fixed point; once it has converged, additional iterations produce the same output.
•

Past-only contextualization. The correction at position $t$ depends on two quantities: the block output $z_{t-1}$ , which encodes the context of tokens $0,\ldots,t{-}1$ , and the token embedding $e_{t}$ . Since $z_{t-1}$ is already cached from processing the previous token, the correction can be computed as soon as token $t$ arrives.

Together, these two properties mean that sequential generation naturally produces the converged correction for each new token without iteration (Section 4).

A new kind of RNN. At sequential inference, the correction at position $t$ depends on $z_{t-1}$ , which depends on $z_{t-2}$ , and so on—creating a recurrent computation that unfolds over all $T$ positions. The non-cumulative + past-only structure is what enables efficient training: it converts the sequential recurrence into a fixed-point problem, so instead of exact $T$ -step BPTT, we can train by unrolling $K$ steps over the full sequence in parallel ( $K\ll T$ ). This is not equivalent to BPTT—positions beyond the first $K$ receive approximate rather than exact gradients—but in practice $K=5$ suffices at the depths tested.

Experimental evidence. We evaluate across widths $C\in\{50,\ldots,2048\}$ , depths $D\in\{1,\ldots,23\}$ , block sizes $\{64,256,512,1024\}$ , two datasets (OpenWebText, Wikipedia), and a synthetic reasoning task (Section 5). The most impactful findings: $D{=}5$ at $C{=}1120$ beats a 12-layer transformer ( $C{=}768$ ), halving inference depth and generating $1.7\times$ faster on an A100. With $K{=}10$ , $D{=}1$ at $C{=}2048$ beats a 6-layer transformer ( $C{=}1088$ ) with a $2.6\times$ inference speedup. Sequential inference matches $K{=}10$ training to within 0.01 PPL, confirming that streaming exactness holds in practice. On pointer chasing, $D{=}1$ trained with BPTT solves all 10 composition levels while standard transformers exhibit staircase-like depth dependence. The architecture benefits most from wide representations and long contexts, and requires a dedicated correction network to be effective (Section 5.8). Any pretrained transformer can be converted by adding a zero-initialized correction FFN and fine-tuning.

2 Related Work

Weight-shared and depth-adaptive architectures. ALBERT (Lan et al., 2020), Universal Transformers (Dehghani et al., 2019) with ACT (Graves, 2016), Deep Equilibrium Models (Bai et al., 2019), and Huginn (Geiping et al., 2025) share weights across layers or iterations and iteratively refine hidden states, but do not use a dedicated past-output/token-aware pre-block correction of the kind proposed here. We compare against standard transformers as the representative baseline for how tokens enter the block. Weight sharing in the context-ready transformer arises naturally from unrolling the sequential correction process into training iterations that process all positions in parallel, not as a design choice for parameter efficiency.

Early exit and layer skipping. LayerSkip (Elhoushi et al., 2024), ADEPT (Yoo et al., 2026), and PonderNet (Banino et al., 2021) reduce average layer count but require learned stopping mechanisms. Context-ready uses a fixed depth with no stopping criterion.

Lookahead Decoding (Fu et al., 2024) and CLLMs (Kou et al., 2024) apply Jacobi iteration to standard transformers as a decoding strategy. Bai et al. (2021) accelerate DEQ inference by parallelizing fixed-point solves via Jacobi-style updates. Context-ready is an architectural change: the $K$ -step unrolling can be viewed as Jacobi iterations on the correction fixed-point equation, but the non-cumulative past-only structure guarantees that a single left-to-right streaming pass recovers the exact correction without any iteration.

Subquadratic and recurrent alternatives. Mamba (Gu and Dao, 2024), RWKV (Peng et al., 2023), Griffin (De et al., 2024), and xLSTM (Beck et al., 2024) replace causal self-attention with compressed recurrent state to achieve linear-time inference. The context-ready transformer solves a different problem: it retains full causal self-attention inside the block and instead changes how tokens enter it. The two approaches are complementary, not competing—one could in principle apply pre-block correction to any of these architectures—and standard transformers remain the natural baseline for evaluating the correction mechanism.

Computational complexity of transformers. Log-precision fixed-depth transformers are confined to $\mathrm{TC}^{0}$ (Merrill and Sabharwal, 2023): they cannot solve problems requiring unbounded sequential composition, regardless of width. RNNs with arbitrary precision escape this limitation (Siegelmann and Sontag, 1995a; Siegelmann and Sontag, 1995b), but classical RNNs require gradients to flow through all $T$ time steps (BPTT), making them difficult to train on long sequences. The context-ready transformer at $D=1$ is recurrent at inference, but is trained by unrolling the correction $K$ times rather than exact $T$ -step BPTT—inheriting recurrent structure at inference while retaining transformer-style parallel training.

3 Method

3.1 Architecture

We first describe the architecture during sequential left-to-right generation—the setting where context-ready inference is exact. Let $T$ denote the sequence length, $C$ the embedding dimension, and $V$ the vocabulary size. The core component is a $D$ -block unit: $D$ transformer layers (each consisting of causal self-attention and a feed-forward network with residual connections), described in detail below. Processing token $t$ requires the outputs $z_{0},\ldots,z_{t-1}\in\mathbb{R}^{C}$ of this block unit from all preceding tokens. Let $e_{t}\in\mathbb{R}^{C}$ denote the token embedding at position $t$ , and define $z_{-1}=\mathbf{0}$ .

Correction. A dedicated correction FFN generates the correction for position $t$ from the cached block output $z_{t-1}$ and the current token embedding $e_{t}$ :

\texttt{correction}_{t}=\texttt{corr\_ffn}\bigl(\texttt{LN}(z_{t-1}+e_{t})\bigr)

(1)

The correction FFN is a feed-forward network (Linear( $C\to 4C$ ) $\to$ GELU $\to$ Linear( $4C\to C$ )) with its own weights, separate from the block’s FFN. The correction is token-aware: it depends on both $z_{t-1}$ (context of tokens $0,\ldots,t{-}1$ ) and $e_{t}$ (the current token embedding). Since both inputs are available when token $t$ arrives, the correction is causal—it depends on no future tokens.

Contextualization. The new token enters the block with the correction added to its raw embedding:

\tilde{x}_{t}=e_{t}+\texttt{correction}_{t}

(2)

Block processing. A $D$ -block unit applies $D$ transformer blocks with separate weights and standard residual connections:

$\displaystyle h^{(0)}$	$\displaystyle=\tilde{x}_{t}$
$\displaystyle a^{(i)}$	$\displaystyle=h^{(i-1)}+\texttt{Attn}_{i}\bigl(\texttt{LN}^{a}_{i}(h^{(i-1)});\;\mathcal{K}_{i},\mathcal{V}_{i}\bigr),\quad i=1,\ldots,D$
$\displaystyle h^{(i)}$	$\displaystyle=a^{(i)}+\texttt{FFN}_{i}\bigl(\texttt{LN}^{f}_{i}(a^{(i)})\bigr)$
$\displaystyle z_{t}$	$\displaystyle=h^{(D)}$	(3)

Each $\texttt{Attn}_{i}$ is causal self-attention with Rotary Position Embeddings (RoPE) (Su et al., 2024). Each $\texttt{FFN}_{i}$ : Linear( $C\to 4C$ ) $\to$ GELU $\to$ Linear( $4C\to C$ ). The parameter $D$ controls inference depth.

Prediction.

\texttt{logits}_{t}=W_{\text{head}}\cdot\texttt{LN}_{f}(z_{t}),\quad W_{\text{head}}\in\mathbb{R}^{V\times C}

(4)

After prediction, $z_{t}$ is cached and the KV caches $\mathcal{K}_{i},\mathcal{V}_{i}$ are updated for future tokens.

3.2 Parallel Training

Sequential inference is exact but inherently serial. For training, we unroll the correction process $K$ times, processing all $T$ positions in parallel at each step. Gradients flow through $K$ unrolling steps rather than $T$ time steps, making the architecture trainable like a transformer despite being recurrent at inference.

Given token embeddings $e=(e_{1},\ldots,e_{T})$ , with $z_{0}^{(k)}=\mathbf{0}$ for all $k$ (the initial cache from Section 3.1), initialize $\texttt{correction}^{(0)}=\mathbf{0}$ . For $k=1,\ldots,K$ :

$\displaystyle\tilde{x}_{t}^{(k-1)}$	$\displaystyle=e_{t}+\texttt{correction}_{t}^{(k-1)}$	(contextualize)
$\displaystyle h^{(0)}$	$\displaystyle=\tilde{x}_{t}^{(k-1)}$
$\displaystyle a^{(i)}$	$\displaystyle=h^{(i-1)}+\texttt{Attn}_{i}\bigl(\texttt{LN}^{a}_{i}(h^{(i-1)});\;\mathcal{K}_{i},\mathcal{V}_{i}\bigr)$	$\displaystyle i=1,\ldots,D$
$\displaystyle h^{(i)}$	$\displaystyle=a^{(i)}+\texttt{FFN}_{i}\bigl(\texttt{LN}^{f}_{i}(a^{(i)})\bigr)$
$\displaystyle z_{t}^{(k)}$	$\displaystyle=h^{(D)}$	(block output)
$\displaystyle\texttt{correction}_{t}^{(k)}$	$\displaystyle=\texttt{corr\_ffn}\bigl(\texttt{LN}(z_{t-1}^{(k)}+e_{t})\bigr)$	$\displaystyle t=1,\ldots,T$	(5)

The $D$ transformer blocks share weights across iterations $k$ but have separate weights across layers $i$ .

Non-cumulative correction. Each iteration replaces the previous correction entirely: $\tilde{x}^{(k)}=e+\texttt{correction}^{(k)}$ , not $\tilde{x}^{(k)}=\tilde{x}^{(k-1)}+f(\tilde{x}^{(k-1)})$ . Only the last correction matters.

Past-only correction. The correction at position $t$ uses $z_{t-1}^{(k)}$ . Corrections propagate left to right: position $0$ converges after one iteration, position $1$ after two, and so on.

Random-depth training ( $k_{\min}$ ). We sample $K\sim\text{Uniform}(k_{\min},K_{\max})$ each batch with $k_{\min}=2$ , forcing the model to produce good predictions at any depth, which empirically encourages contraction.

Loss and dropout. The training loss (cross-entropy) is computed on the logits from the final iteration $K$ only. Dropout masks are resampled independently at each unrolling step $k$ .

3.3 Streaming Inference

Algorithm 1 Context-Ready Streaming Inference

0: Blocks

\texttt{Attn}_{1},\texttt{FFN}_{1},\ldots,\texttt{Attn}_{D},\texttt{FFN}_{D}

; correction FFN; KV caches

\mathcal{C}_{1},\ldots,\mathcal{C}_{D}

; previous block output

z_{\text{prev}}

(init.

\mathbf{0}

)

1: for each new token with embedding

e

\texttt{correction}\leftarrow\texttt{corr\_ffn}\bigl(\texttt{LN}(z_{\text{prev}}+e)\bigr)

{from past context + token identity}

h\leftarrow e+\texttt{correction}

{contextualized input}

4: for

i=1,\ldots,D

h\leftarrow h+\texttt{Attn}_{i}(\texttt{LN}^{a}_{i}(h);\;\mathcal{C}_{i})

; update

\mathcal{C}_{i}

h\leftarrow h+\texttt{FFN}_{i}(\texttt{LN}^{f}_{i}(h))

7: end for

z_{\text{prev}}\leftarrow h

{cache for next token’s correction}

\texttt{logits}\leftarrow W_{\text{head}}\cdot\texttt{LN}_{f}(h)

10: end for

When a new token arrives with embedding $e_{t}$ , the model computes the correction from the cached $z_{t-1}$ and $e_{t}$ , passes the corrected embedding through the $D$ -block unit, caches the output $z_{t}$ , and predicts. This is one forward pass—no iteration over $K$ steps—regardless of the training depth $K$ .

Why inference needs no iteration. During training, $K$ unrolling steps refine the corrections. At inference, this iteration is unnecessary: since earlier positions are already computed and cached, the correction for token $t$ is fully determined by $z_{t-1}$ and $e_{t}$ in a single pass (Theorem 2). The training approximation matters only during training: the first $K$ positions converge exactly after $K$ steps (Lemma 1 in Appendix A.2), and for later positions, the approximation error shrinks geometrically with $K$ when the correction operator is contractive (Theorem 3).

4 Theoretical Analysis

Full formal statements and proofs are in Appendix A.

Theorem 1 (Structural characterization).

Why non-cumulative and past-only? Under Assumptions I–II (Appendix A.1), if a weight-shared architecture unrolls a shared block $K$ times during training and applies it once per token during streaming, then for the unrolled training to converge to the same output that streaming produces, the correction must be non-cumulative ( $\tilde{x}=e+\texttt{correction}$ , not a sum of successive increments) and past-only (the correction at position $t$ depends only on $e_{t}$ and corrections from positions $1,\ldots,t{-}1$ ). The resulting system has a unique fixed point, and streaming computes it exactly.

Full proof in Appendix A.1.

Theorem 2 (Exact streaming fixed point).

Why is inference exact without iteration? During sequential generation, the correction at position $t$ depends only on $z_{0},\ldots,z_{t-1}$ , which are already computed and cached. By prefix consistency (appending tokens does not change the operator at earlier positions, which holds by causal masking), the correction is exact in a single pass.

Full proof in Appendix A.2.

Theorem 3 (Training convergence).

How fast does the training iteration converge? If the correction operator $G$ is $L$ -Lipschitz with $L<1$ , then $K$ unrolling steps reduce the error to the fixed point by a factor of $L^{K}$ . This governs the training approximation: positions beyond the first $K$ receive approximate corrections, and the approximation improves geometrically with $K$ .

Full formal statement in Appendix A.3.

Proposition 1 (Depth separation).

Why can the context-ready architecture use fewer layers? Under a stylized state-tracking abstraction (Appendix A.4):

(a)

Context-ready propagation is handled by the correction chain. The context-ready architecture needs only $D$ layers for the per-token map; propagation across the sequence is handled by the recurrent correction chain rather than extra transformer layers.
(b)

Standard transformers need depth for propagation. With attention window $W$ , a standard transformer needs at least $\lceil T/W\rceil$ layers just for information from the earliest tokens to reach position $T$ , on top of the layers needed for the per-token map.

Full statement and proof in Appendix A.4. When propagation and local computation cannot be interleaved (as in pointer chasing), the standard transformer needs at least $\lceil T/W\rceil$ additional layers; in general, some layers may serve both roles.

5 Experiments

5.1 Setup

Data. OpenWebText (Gokaslan and Cohen, 2019) with byte-pair encoding (BPE, vocabulary 32,000). Context lengths: 64, 256, 512, and 1024 depending on the experiment. Ablations use English Wikipedia (BPE 16k); additional Wikipedia results in Appendix C.8. All results are validation perplexity (PPL) on held-out splits.

Training. AdamW optimizer (Loshchilov and Hutter, 2019), gradient clipping at 1.0. Learning rate $2\times 10^{-4}$ unless noted. Training depth $K=5$ with $k_{\min}=2$ by default; $K=10$ where noted. Dropout 0.2, FFN expansion factor 4. Full hyperparameters in Appendix C.

FLOP accounting.¹¹1Following standard convention in the literature, we count multiply-accumulate operations and label them FLOPs; actual floating-point operations are roughly $2\times$ . We report total FLOPs per token, including the transformer blocks, correction FFN, and prediction head ( $W_{\text{head}}\in\mathbb{R}^{V\times C}$ , costing $VC$ FLOPs/token). Each transformer block costs $12C^{2}$ FLOPs ( $4C^{2}$ for attention projections, $8C^{2}$ for the FFN). A context-ready model with $D$ blocks costs $D\times 12C^{2}+8C^{2}+VC$ ; a standard $N$ -layer transformer costs $N\times 12C^{2}+VC$ . Same-width comparisons (Tables 2, 3) share the same prediction head and are FLOP-matched. Cross-width comparisons (Table 1) explore depth-width tradeoffs: wider models pay a larger prediction head, so total FLOPs differ. Despite higher total FLOPs, the wider, shallower context-ready models deliver lower wall-clock inference time because fewer sequential layers dominate latency on modern GPUs (Section 5.4).

Baselines. Standard transformers with separate weights per layer and RoPE attention (Su et al., 2024). All results are single runs. The breadth of the evaluation—across widths ( $C{=}50$ – $2048$ ), depths ( $D{=}1$ – $23$ ), block sizes (64–1024), two datasets, a synthetic task, and multiple training strategies—provides stronger evidence than multi-seed runs on a single configuration: the context-ready architecture wins consistently across all these axes.

5.2 Cross-Width Results

Table 1: Context-ready vs. standard transformers on OpenWebText at two compute scales. FLOPs/tok includes the prediction head (

VC

). Despite higher total FLOPs, the wider context-ready models provide lower wall-clock inference time (Section 5.4). All context-ready models use add variant,

K{=}5

k_{\min}{=}2

Model	FLOPs/tok	$C/N$	Val PPL	$\Delta$
D=5 $C{=}1120$	121M	224	36.38
D=6 $C{=}1024$	117M	171	36.56
Roformer $N{=}6$ $C{=}1088$	120M	181	37.76	$-$ 1.38
Roformer $N{=}12$ $C{=}768$	110M	64	37.83	$-$ 1.45
Roformer $N{=}2$ $C{=}1888$	146M	944	42.99
Roformer $N{=}24$ $C{=}1088$	376M	45	28.68
Roformer $N{=}12$ $C{=}1536$	389M	128	29.01
D=6 $C{=}2048$	401M	341	29.04	$+$ 0.03
Roformer $N{=}6$ $C{=}2176$	411M	363	30.35

Table 1 compares context-ready models against standard transformers across depth-width tradeoffs at two compute scales. At the smaller scale (block size 256, 100K iterations), context-ready $D{=}5$ at $C{=}1120$ achieves 36.38 PPL, beating both roformer $N{=}6$ at $C{=}1088$ (37.76, $\Delta=-1.38$ ) and roformer $N{=}12$ at $C{=}768$ (37.83, $\Delta=-1.45$ ). At the larger scale (block size 256, 200K iterations), context-ready $D{=}6$ at $C{=}2048$ (29.04) matches roformer $N{=}12$ at $C{=}1536$ (29.01), despite using a much higher width-to-depth ratio. Depth has diminishing returns: going from $N{=}12$ to $N{=}24$ gains only 0.33 PPL.

5.3 Correction Efficiency

Table 2: Correction efficiency at

C=1024

, block size 256, 200K iterations (OWT). Same width throughout, so the prediction head is identical across all rows and the comparison is FLOP-matched.

Model	Inference FLOPs	Val PPL
Roformer $N{=}12$	$144C^{2}$	33.41
Roformer $N{=}13$	$156C^{2}$	32.82
D=12 context-ready	$\mathbf{152C^{2}}$	32.28
Roformer $N{=}14$	$168C^{2}$	32.34
Roformer $N{=}24$	$288C^{2}$	29.42
D=23 context-ready	$\mathbf{284C^{2}}$	28.89

Table 2 isolates the value of the correction mechanism at fixed width ( $C=1024$ ), so the prediction head is identical across all rows and the comparison is FLOP-matched. $D=12$ context-ready ( $152C^{2}$ FLOPs) beats roformer $N{=}13$ ( $156C^{2}$ ) by 0.54 PPL and matches roformer $N{=}14$ ( $168C^{2}$ ). The correction FFN adds only $8C^{2}$ FLOPs yet provides a genuine PPL improvement at the same parameter budget. At deeper scale, $D=23$ ( $284C^{2}$ ) edges out $N=24$ ( $288C^{2}$ ) by 0.53 PPL—directionally consistent with the $D{=}12$ result, but a single-run margin that should be read as evidence of parity rather than robust superiority.

5.4 Width Scaling

Table 3: Width scaling:

D{=}2

context-ready vs. roformer

N{=}2

at the same width (block size 64, Chinchilla-matched token budgets, OWT). The relative advantage grows with width.

$C$	$N{=}2$ PPL	$D{=}2$ PPL	$\Delta$	Relative
256	158.83	143.57	$-$ 15.26	9.6%
512	95.48	84.69	$-$ 10.79	11.3%
1024	72.83	60.84	$-$ 11.99	16.5%

Table 3 tests whether the correction advantage is a small-scale artifact. Comparing $D{=}2$ vs. $N{=}2$ at the same width with Chinchilla-matched token budgets, the relative improvement grows from 9.6% at $C=256$ to 16.5% at $C=1024$ .

Token-matched results. At $C=1024$ with block size 64, $D{=}x$ beats $N{=}x$ at every depth tested once training progresses past a crossover point: $D{=}1$ beats $N{=}1$ by 34.4 PPL (crossover at ${\sim}424$ M tokens), $D{=}2$ by 7.6 PPL ( ${\sim}565$ M), $D{=}3$ by 3.0 PPL ( ${\sim}835$ M), and $D{=}6$ by 0.7 PPL ( ${\sim}1{,}032$ M). All gaps are still growing at the end of training. The correction mechanism provides a consistent advantage at every depth; the advantage is largest when depth is small, consistent with the correction doing the most work when the block has the fewest layers.

Inference latency. We measure autoregressive generation speed on an A100 over 10,000 tokens with KV caching. $D{=}1$ $C{=}2048$ (149M FLOPs/tok) generates at 919 tokens/s vs. 351 tokens/s for roformer $N{=}6$ $C{=}1088$ (120M FLOPs/tok)—a $2.6\times$ speedup despite higher total FLOPs. $D{=}5$ $C{=}1120$ (121M FLOPs/tok) generates at 349 tokens/s vs. 201 tokens/s for roformer $N{=}12$ $C{=}768$ (110M FLOPs/tok)—a $1.7\times$ speedup. The wider, shallower models are faster because fewer sequential layers dominate inference latency on modern GPUs, even when total FLOPs are higher. Per-token latency is flat across sequence length, confirming that KV caching amortizes attention cost to $O(T)$ per token. Full timing details in Appendix C.9.

KV cache savings. Fewer layers also reduce KV cache memory ( $C\times D$ per token). Despite being wider, $D{=}5$ at $C{=}1120$ uses $1.6\times$ less cache than $N{=}12$ at $C{=}768$ ; $D{=}1$ at $C{=}2048$ uses $3.2\times$ less than $N{=}6$ at $C{=}1088$ .

5.5 Single-Layer Performance ( $D{=}1$ )

Proposition 1(a) predicts that $D{=}1$ may match deeper transformers when the task is dominated by context propagation and sufficient $K$ and width are provided. We test $D{=}1$ , $C{=}2048$ (149M FLOPs/tok) against roformer $N{=}6$ , $C{=}1088$ (120M FLOPs/tok) at block size 1024 using three training strategies: fixed $K{=}10$ , random-depth $K{=}10$ with $k_{\min}{=}2$ , and fine-tuning from a pretrained $N{=}1$ roformer (batch 16).

Table 4: Three training strategies for

D{=}1

C{=}2048

(149M FLOPs/tok) vs. roformer

N{=}6

C{=}1088

(120M FLOPs/tok), block size 1024, OWT. Fine-tune starts from

N{=}1

C{=}2048

pretrained for 85K iterations; “total iters” includes pretraining.

Strategy	PPL at 100K	Best PPL	Notes
Roformer $N{=}6$ (baseline)	35.37	—
$K{=}10$ , fixed depth	34.40	33.63 (110K)	Beats $N{=}6$ at ${\sim}$ 65K
$K{=}10$ , $k_{\min}{=}2$	36.28	33.63 (135K)	+1.88 penalty at 100K
$K{=}10$ , fine-tuned	46.66^∗	31.35 (215K)	Best final PPL
^∗At 100K total (15K fine-tune); fine-tune reaches 35.92 at 150K total.

Table 4 compares the three strategies. In these $D{=}1$ experiments, larger training depth $K$ substantially improves the model’s ability to exploit the available recurrent depth. With fixed $K{=}10$ , $D{=}1$ surpasses $N{=}6$ at ${\sim}65$ K iterations and reaches 34.40 at 100K ( $\Delta=-0.97$ , still improving) at $2\times$ per-iteration cost.

Random-depth training ( $k_{\min}{=}2$ ) incurs a modest penalty of 1.88 PPL at 100K relative to fixed depth, but converges to the same quality ${\sim}25$ K later (both reach 33.63). The penalty yields $\hat{L}$ values consistent with contraction ( $\hat{L}\approx 0.55$ vs. $\hat{L}>1.0$ for fixed $K$ ).

Fine-tuning from a pretrained $N{=}1$ roformer ( $K{=}1$ ) with a zero-initialized correction FFN at $K{=}10$ reaches the best final PPL: 31.35 at 215K total iterations (the $N{=}6$ baseline is at 100K, so total compute differs).

The gap between $D{=}1$ and $N{=}6$ shrinks with block size ( $+1.19$ at 256, $+0.46$ at 512, $+0.31$ at 1024 with $K{=}5$ ), consistent with longer contexts providing more sequential steps. Full block-size scaling in Appendix C.5.

5.6 Pointer Chasing: Depth Separation

Table 5: Pointer chasing (10-hop, 5 keys, 10 values, windowed attention).

N

-layer transformers exhibit a staircase-like depth dependence: deeper models solve more levels. The context-ready

D{=}1

model (BPTT) solves all levels.

Model	Levels solved	Iters
Roformer $N{=}1$	1 / 11	50K
Roformer $N{=}3$	3 / 11	50K
Roformer $N{=}5$	6 / 11	50K
Roformer $N{=}10$	7 / 11	50K
Roformer $N{=}11$	8 / 11	50K
Roformer $N{=}12$	11 / 11	50K
D=1 context-ready (BPTT)	11 / 11	${\sim}$ 16K

An $N$ -layer transformer can compose at most $N$ sequential reasoning steps in a single forward pass (Merrill and Sabharwal, 2023). To test whether the context-ready architecture can exceed this depth limit, we design a pointer-chasing task. The input contains a base table that maps keys to values, followed by $H{=}10$ levels of index tables, each of which maps new keys to keys at the previous level. Answering a query at level $\ell$ therefore requires chaining $\ell$ sequential lookups through the tables, and we use windowed causal attention (window $=38$ ) to prevent the model from bypassing these chains by attending directly to the base table. A full task specification with a worked example is given in Appendix B.

Table 5 shows a staircase-like depth dependence: deeper transformers solve more levels, while shallow transformers fail well before full depth. The context-ready $D{=}1$ model, trained with BPTT to exploit the full sequential depth of the recurrent correction chain, solves all 11 levels in ${\sim}16$ K iterations and scales to 20 hops (all 21 levels).

5.7 Fine-Tuning from Pretrained

Any standard transformer can be converted to context-ready by adding a zero-initialized correction FFN and fine-tuning. To isolate the effect of conversion from additional training, we compare against a same-iteration control: the original roformer trained for the same total number of iterations without conversion (per-iteration compute differs because fine-tuning runs the block $K$ times). At $C{=}1408$ , $N{=}12$ reaches 29.92 PPL at 200K iterations; continued training to 400K yields 27.20. Converting at 200K and fine-tuning to 400K total iterations yields 26.14—a gain of $-1.06$ PPL over the continued-training baseline at matched iterations. At $C{=}1024$ , converting $N{=}24$ to $D{=}24$ improves from 29.42 to 28.99 ( $-0.43$ ) in 18K fine-tuning iterations. The zero-initialized correction ensures no disruption at conversion (the model is function-preserving). During fine-tuning, $D{=}12$ sees a transient $+0.11$ PPL increase before recovering, while $D{=}24$ shows no transient increase.

5.8 Ablations and Diagnostics

We compare against alternative architectures that attempt the same goal. Among the variants tested, the gain depends on a dedicated correction network with its own weights.

Convergence and sequential exactness. Table 6 shows the full depth progression. Convergence is geometric: at $D{=}1$ (block size 1024), $K{=}2$ closes 91% of the $K{=}1$ -to- $K{=}5$ gap and $K{=}3$ closes 98%, consistent with Theorem 3. Sequential $K{=}1$ matches parallel $K{=}10$ to within 0.01 PPL at every configuration, confirming Theorem 2. The correction contribution (“Corr.” column) grows as $D$ shrinks: 38–55 PPL at $D{=}1$ vs. 3.9 at $D{=}23$ .

Table 6: PPL at parallel

K{=}1{,}2{,}3{,}5{,}10

and sequential

K{=}1

(OWT).

D{=}1

C{=}2048

;

D{=}5{,}12{,}23

C{=}1024

, block size 256. The “Corr.” column measures how much the correction mechanism contributes: the PPL difference between

K{=}1

(no correction) and

K{=}5

(converged).

$D$	bs	$K{=}1$	$K{=}2$	$K{=}3$	$K{=}5$	$K{=}10$	Seq	Corr.
1	256	70.80	36.21	33.23	32.70	32.79	32.80	38.1
1	512	73.22	34.33	31.24	30.61	30.68	30.69	42.6
1	1024	84.13	34.42	30.39	29.51	29.43	29.43	54.7
5	256	58.17	40.18	38.80	38.60	38.61	38.62	19.6
12	256	38.33	32.58	32.30	32.29	32.29	32.29	6.0
23	256	32.80	29.03	28.89	28.88	28.88	28.88	3.9

Contraction. With $k_{\min}=2$ , empirical $\hat{L}\in[0.50,0.72]$ (measured as $\hat{L}=\max_{t}\|c_{K,t}-c_{K-1,t}\|/\|c_{K-1,t}-c_{K-2,t}\|$ ); without $k_{\min}$ , $\hat{L}\in[0.88,1.20]$ . $\hat{L}$ is a trajectory-local diagnostic, not the global $L$ in Theorem 3.

Correction FFN is essential. Using the block’s own residual as the correction (“block_head”) gives no improvement (27.32 vs. 27.19 for a standard transformer). With a dedicated correction FFN, context-ready beats the FLOP-matched baseline by 1.8 PPL. Tying correction FFN weights to the block FFN collapses performance. The add variant ( $\texttt{LN}(z_{t-1}+e_{t})$ ) matches or beats token-blind at all depths.

Training efficiency. $K{=}5$ suffices at $D\geq 5$ ; $K{=}2$ with torch.compile achieves $1.7\times$ faster training at 1.07 PPL cost. Full ablation tables in Appendix C.2.

6 Conclusion

A standard transformer assigns each new token a context-free embedding and relies entirely on depth to contextualize it. The context-ready transformer shortcuts this process: a correction derived from the previous position’s block output pre-contextualizes the token before it enters the block. Two structural choices—non-cumulative correction and past-only contextualization—make this exact at streaming inference (Theorem 2) and trainable with $K$ unrolling steps (each processing all $T$ positions in parallel) rather than full BPTT.

A $D{=}5$ model beats a 12-layer transformer while generating $1.7\times$ faster; $D{=}1$ beats a 6-layer transformer with a $2.6\times$ speedup, and sequential inference matches parallel $K{=}10$ to within 0.01 PPL. The advantage grows with width and context length. Fewer layers also reduce KV cache memory. On pointer chasing, $D{=}1$ solves all composition levels that standard transformers need proportional depth to reach. Any pretrained transformer can be converted by adding a zero-initialized correction FFN and fine-tuning.

Limitations. Datasets and scale. Results are on OpenWebText, Wikipedia, and a synthetic task. We have validated at 110–150M and 375–410M FLOPs/token but not yet on standard benchmarks or at billion-parameter scale. Training cost. From-scratch training runs the block $K$ times per iteration rather than once. Backpropagation through $K$ steps also requires storing activations for all $K$ passes, scaling activation memory by $K\times$ . Several approaches can reduce this cost: (i) pretrain as a standard transformer at $K{=}1$ cost, then convert and fine-tune—in our experiments this matches or exceeds from-scratch context-ready training (Sections 5.5 and 5.7); (ii) random-depth training ( $k_{\min}$ ), which samples $K$ each batch and converges to the same quality as fixed $K$ (Section 5.5). Prefill. Processing a prompt of length $T$ in parallel requires $K$ unrolling steps, giving effective prefill depth $K\times D$ vs. $N$ for a standard transformer.

References

Bai et al. (2019) Shaojie Bai, J. Zico Kolter, and Vladlen Koltun. Deep equilibrium models. In Advances in Neural Information Processing Systems, 2019.
Bai et al. (2021) Shaojie Bai, Vladlen Koltun, and J. Zico Kolter. Accelerating feedforward computation via parallel nonlinear equation solving. In International Conference on Machine Learning, 2021.
Banino et al. (2021) Andrea Banino, Jan Balaguer, and Charles Blundell. PonderNet: Learning to ponder. In ICML Workshop on Uncertainty and Robustness in Deep Learning, 2021.
Beck et al. (2024) Maximilian Beck, Korbinian Pöppel, Markus Spanring, Andreas Auer, Oleksandra Prudnikova, Michael Kopp, et al. xLSTM: Extended long short-term memory. arXiv preprint arXiv:2405.04517, 2024.
De et al. (2024) Soham De, Samuel L. Smith, Anushan Fernando, Aleksandar Botev, et al. Griffin: Mixing gated linear recurrences with local attention for efficient language models. arXiv preprint arXiv:2402.19427, 2024.
Dehghani et al. (2019) Mostafa Dehghani, Stephan Gouws, Oriol Vinyals, Jakob Uszkoreit, and Łukasz Kaiser. Universal transformers. In International Conference on Learning Representations, 2019.
Elhoushi et al. (2024) Mostafa Elhoushi, Akshat Shrivastava, Diana Liskovich, Basil Hosmer, Bram Wasti, Liangzhen Lai, Anas Mahmoud, Bilge Acber, Saurabh Agarwal, Ahmed Roman, et al. LayerSkip: Enabling early exit inference and self-speculative decoding. arXiv preprint arXiv:2404.16710, 2024.
Fu et al. (2024) Yichao Fu, Peter Bailis, Ion Stoica, and Hao Zhang. Break the sequential dependency of LLM inference using lookahead decoding. arXiv preprint arXiv:2402.02057, 2024.
Geiping et al. (2025) Jonas Geiping, Tom Goldstein, Avi Schwarzschild, C. Bayan Bruss, et al. Scaling up test-time compute with latent reasoning: A recurrent depth approach. arXiv preprint arXiv:2502.05171, 2025.
Gokaslan and Cohen (2019) Aaron Gokaslan and Vanya Cohen. Openwebtext corpus. http://Skylion007.github.io/OpenWebTextCorpus, 2019.
Graves (2016) Alex Graves. Adaptive computation time for recurrent neural networks. arXiv preprint arXiv:1603.08983, 2016.
Gu and Dao (2024) Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces. In Proceedings of ICML 2024, 2024.
Kou et al. (2024) Siqi Kou, Lanxiang Hu, Zhezhi He, Zhijie Deng, and Hao Zhang. CLLMs: Consistency large language models. arXiv preprint arXiv:2403.00835, 2024.
Lan et al. (2020) Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. ALBERT: A lite BERT for self-supervised learning of language representations. In International Conference on Learning Representations, 2020.
Loshchilov and Hutter (2019) Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. In International Conference on Learning Representations, 2019.
Merrill and Sabharwal (2023) William Merrill and Ashish Sabharwal. The parallelism tradeoff: Limitations of log-precision transformers. In Transactions of the Association for Computational Linguistics, volume 11, pages 531–545, 2023.
Peng et al. (2023) Bo Peng, Eric Alcaide, Quentin Anthony, Alon Albalak, Samuel Arcadinho, Huanqi Cao, Xin Cheng, Michael Chung, et al. RWKV: Reinventing RNNs for the transformer era. In Findings of EMNLP 2023, 2023.
Siegelmann and Sontag (1995a) Hava T. Siegelmann and Eduardo D. Sontag. Computational capabilities of recurrent NARX neural networks. IEEE Transactions on Systems, Man, and Cybernetics, 26(4):535–544, 1995a.
Siegelmann and Sontag (1995b) Hava T. Siegelmann and Eduardo D. Sontag. On the computational power of neural nets. Journal of Computer and System Sciences, 50(1):132–150, 1995b.
Su et al. (2024) Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. RoFormer: Enhanced transformer with rotary position embedding. Neurocomputing, 568:127063, 2024.
Yoo et al. (2026) Seunghyun Yoo et al. ADEPT: Adaptive dynamic early-exit process for transformers. arXiv preprint arXiv:2601.03700, 2026.

NeurIPS Paper Checklist

1.

Claims
Answer: [Yes]
Justification: The abstract and introduction state the architectural contributions and experimental results with specific numbers. Limitations (datasets, scale, training cost) are discussed explicitly in Section 6.
2.

Limitations
Answer: [Yes]
Justification: Section 6 includes a dedicated Limitations paragraph covering dataset scope, scale, training cost, and single-run reporting.
3.

Theory assumptions and proofs
Answer: [Yes]
Justification: All theorems and propositions are numbered and cross-referenced. Assumptions are stated explicitly in the appendix proofs (Appendix A.1, A.2, A.3, A.4). Main-body statements reference the appendix for full proofs.
4.

Experimental result reproducibility
Answer: [Yes]
Justification: The architecture is fully described in Section 3. Hyperparameters, optimizer settings, learning rate schedules, and training details are provided in Appendix C. All experiments use publicly available data (OpenWebText, Wikipedia).
5.

Open access to data and code
Answer: [No]
Justification: Code is not released at submission time due to patent considerations. The architecture and training procedure are described in sufficient detail to reproduce.
6.

Experimental setting/details
Answer: [Yes]
Justification: Section 5.1 describes data, tokenization, context lengths, optimizer, and baselines. Appendix C provides full hyperparameter tables.
7.

Experiment statistical significance
Answer: [No]
Justification: All results are single runs. We acknowledge this explicitly and report trends across multiple configurations (widths, depths, block sizes) rather than relying on individual comparisons.
8.

Experiments compute resources
Answer: [Yes]
Justification: Section 5.1 reports FLOPs per token and training iterations. Inference timing is measured on a single A100 GPU (Appendix C.9).
9.

Code of ethics
Answer: [Yes]
Justification: The research conforms with the NeurIPS Code of Ethics. No human subjects, private data, or dual-use concerns.
10.

Broader impacts
Answer: [N/A]
Justification: This is foundational architecture research. The work reduces inference cost for language models, which has broadly positive efficiency implications but no direct path to specific negative applications beyond those inherent to language modeling in general.
11.

Safeguards
Answer: [N/A]
Justification: No pretrained language models or scraped datasets are released.
12.

Licenses for existing assets
Answer: [Yes]
Justification: OpenWebText is cited [Gokaslan and Cohen, 2019]. Wikipedia is publicly available. All referenced works are cited.
13.

New assets
Answer: [N/A]
Justification: No new datasets, models, or code are released with this submission.
14.

Crowdsourcing and research with human subjects
Answer: [N/A]
Justification: No crowdsourcing or human subjects research.
15.

Institutional review board (IRB) approvals or equivalent for research with human subjects
Answer: [N/A]
Justification: No human subjects research.
16.

Declaration of LLM usage
Answer: [N/A]
Justification: LLMs were not used as a component of the core methodology.

Appendix A Full Proofs

Notation. Throughout the appendix, $e$ denotes the base token embeddings, $G$ denotes the full correction operator, and $c^{(k)}$ denotes the correction vector at iteration $k$ . The fixed point is $c^{*}=G(c^{*})$ .

A.1 Proof of Theorem 1 (Structural Characterization)

Formal statement.

Setup. Let $e_{t}\in\mathbb{R}^{d}$ denote the base embedding at position $t$ , let Block denote a $D$ -layer transformer block with causal attention, and let $F:\mathbb{R}^{d}\times\mathbb{R}^{d}\to\mathbb{R}^{d}$ be a correction function. The architecture processes position $t$ as follows: form the corrected input $\tilde{x}_{t}=e_{t}+F(e_{t},z_{t-1})$ , compute $z_{t}=\texttt{Block}(\tilde{x}_{t};\,\tilde{x}_{<t})$ , and cache $z_{t}$ for future positions.

Assumptions.

(I)

Additive correction. The corrected input has the form $\tilde{x}_{t}=e_{t}+F(e_{t},z_{t-1})$ , where $F$ is a continuous correction function that maps into a bounded ball $\overline{B}(0,B)\subset\mathbb{R}^{d}$ .
(II)

Single-pass streaming. During inference, tokens are processed left-to-right, one at a time. The correction at position $t$ uses the cached block output $z_{t-1}$ from the previous position and the current embedding $e_{t}$ . By causality, $z_{t-1}$ depends only on positions $\leq t{-}1$ .

Conclusions.

(a)

Past-only. By streaming (Property II), the correction at position $t$ depends only on previously computed outputs: $F(e_{t},z_{t-1})$ where $z_{t-1}$ encodes positions $0,\ldots,t{-}1$ .
(b)

Non-cumulative. Among additive unrolling strategies, only the non-cumulative form $\tilde{x}_{t}^{(k)}=e_{t}+F(e_{t},z_{t-1}^{(k)})$ is compatible with streaming. The cumulative (resnet) alternative $\tilde{x}_{t}^{(k)}=\tilde{x}_{t}^{(k-1)}+F(\tilde{x}_{t}^{(k-1)},z_{t-1}^{(k)})$ fails because $F(\tilde{x}_{t}^{*},z_{t-1}^{*})=0$ at the fixed point.
(c)

Existence and uniqueness. The non-cumulative past-only system has a unique fixed point, and streaming computes it exactly.

Proof.

Part (a): Past-only. Streaming (Property II) processes tokens left-to-right. When token $t$ arrives, the correction uses $z_{t-1}$ (cached from the previous token) and $e_{t}$ . Since causal attention ensures $z_{t-1}$ depends only on positions $\leq t{-}1$ , the correction at position $t$ is a function of past outputs only.

Part (b): Non-cumulative. Given the additive correction form (Property I), there are two ways to unroll $K$ training steps:

Non-cumulative: $\tilde{x}_{t}^{(k)}=e_{t}+F(e_{t},z_{t-1}^{(k)})$ . Each step recomputes the correction from the base embedding $e_{t}$ . At the fixed point, $F(e_{t},z_{t-1}^{*})=c_{t}^{*}$ , a nonzero correction. In streaming, past outputs are exact (from cache), so a single evaluation gives $\tilde{x}_{t}=e_{t}+F(e_{t},z_{t-1}^{*})=e_{t}+c_{t}^{*}=\tilde{x}_{t}^{*}$ . Exact.

Cumulative (resnet): $\tilde{x}_{t}^{(k)}=\tilde{x}_{t}^{(k-1)}+F(\tilde{x}_{t}^{(k-1)},z_{t-1}^{(k)})$ . Each step adds an increment to the previous output. At the fixed point, $\tilde{x}_{t}^{*}=\tilde{x}_{t}^{*}+F(\tilde{x}_{t}^{*},z_{t-1}^{*})$ , which forces $F(\tilde{x}_{t}^{*},z_{t-1}^{*})=0$ : the correction function learns to output zero at convergence. In streaming, the single step starts from $\tilde{x}_{t}^{(0)}=e_{t}$ and gives $\tilde{x}_{t}=e_{t}+F(e_{t},z_{t-1}^{*})$ . But $F$ was trained to vanish at $\tilde{x}_{t}^{*}$ , not at $e_{t}$ , so the result is neither zero nor the correct fixed point $\tilde{x}_{t}^{*}$ .

Therefore, among additive unrolling strategies, the non-cumulative form is the one that reproduces the correct fixed-point output during streaming.

Part (c): Existence and uniqueness. By part (a), the system is triangular: $\tilde{x}_{1}^{*}$ is determined first (from $z_{0}=\mathbf{0}$ and $e_{1}$ ), then $z_{1}^{*}$ , then $\tilde{x}_{2}^{*}$ , and so on. Each step is a deterministic evaluation, so the fixed point exists, is unique, and is constructively computable by left-to-right evaluation—exactly the streaming computation. ∎

A.2 Proof of Theorem 2 (Exact Streaming)

Formal statement. Let $c^{*(T)}$ be the unique fixed point for a sequence of length $T$ . Assume prefix consistency: for any $T^{\prime}>T$ and $t\leq T$ , the correction operator satisfies $G_{t}^{(T^{\prime})}=G_{t}^{(T)}$ , i.e., appending tokens beyond position $T$ does not change the operator at earlier positions. (This holds by construction for any causal architecture where the operator at position $t$ depends only on positions $\leq t$ .) Then:

(a)

Prefix invariance. $c^{*(T^{\prime})}_{t}=c^{*(T)}_{t}$ for $t\leq T$ and any $T^{\prime}>T$ .
(b)

Exactness. If past corrections are at the fixed point, the streaming operator produces the exact fixed-point correction for the new token.
(c)

No contraction needed. Exactness holds without any contraction assumption.

The following lemma establishes that the Jacobi iteration converges in finitely many steps, which is the foundation for the streaming exactness proof.

Lemma 1 (Finite-step exactness).

Let $G$ be a past-only correction operator. Then: (a) The fixed point $c^{*}$ exists and is unique, without contraction. (b) After $k$ Jacobi iterations from any $c^{(0)}$ : $c_{t}^{(k)}=c_{t}^{*}$ for all $t\leq k$ . (c) The iteration reaches the exact fixed point after at most $T$ steps: $c^{(T)}=c^{*}$ .

Proof.

(a) By construction: $G_{1}$ is a constant, so $c_{1}^{*}$ is determined. Given $c_{1}^{*},\ldots,c_{t-1}^{*}$ , we have $c_{t}^{*}=G_{t}(c_{1}^{*},\ldots,c_{t-1}^{*})$ uniquely.

(b) By induction. Base: $c_{1}^{(1)}=G_{1}=c_{1}^{*}$ . Step: assume $c_{s}^{(k)}=c_{s}^{*}$ for $s\leq k$ . Then $c_{k+1}^{(k+1)}=G_{k+1}(c_{1}^{(k)},\ldots,c_{k}^{(k)})=G_{k+1}(c_{1}^{*},\ldots,c_{k}^{*})=c_{k+1}^{*}$ .

Proof of Theorem 2.

(a) By prefix consistency, $G_{t}^{(T^{\prime})}=G_{t}^{(T)}$ for $t\leq T$ . Hence the fixed-point equations for positions $1,\ldots,T$ are identical in the length- $T$ and length- $T^{\prime}$ systems, so $c^{*(T^{\prime})}_{t}=c^{*(T)}_{t}$ .

(b) $c^{*(T+1)}_{T+1}=G^{(T+1)}_{T+1}(c^{*(T+1)}_{1},\ldots,c^{*(T+1)}_{T})$ . By (a), $c^{*(T+1)}_{t}=c^{*(T)}_{t}$ for $t\leq T$ . The streaming operator computes exactly $G^{(T+1)}_{T+1}$ using cached corrections $c^{*(T)}$ . Concretely, the abstract operator $G_{t}$ is realized by the correction FFN applied to the cached block output $z_{t-1}$ and the current token embedding $e_{t}$ : once earlier positions are at their fixed-point values, $z_{t-1}$ is fully determined, and the streaming step computes $c_{t}=G_{t}(c_{1}^{*},\ldots,c_{t-1}^{*};e_{t})$ exactly.

A.3 Proof of Theorem 3 (Convergence)

Formal statement. Let $G_{t}(c_{<t};e)$ denote the full correction operator at position $t$ . Assume $\|G(c;e)-G(c^{\prime};e^{\prime})\|\leq L\|c-c^{\prime}\|+M\|e-e^{\prime}\|$ for all $c,c^{\prime},e,e^{\prime}$ . If $L<1$ , then:

(a)

Contraction. $\|c^{(k)}-c^{*}\|\leq L^{k}\|c^{(0)}-c^{*}\|$ .
(b)

Warm-start bound. $\|G(c^{*};e^{\prime})-c^{*\prime}\|\leq\frac{LM}{1-L}\|e^{\prime}-e\|$ .

Proof.

Part (a). By the Banach fixed-point theorem with contraction constant $L<1$ .

Part (b). $\|c^{*}-c^{*\prime}\|\leq M\|e-e^{\prime}\|+L\|c^{*}-c^{*\prime}\|$ , so $\|c^{*}-c^{*\prime}\|\leq\frac{M}{1-L}\|e^{\prime}-e\|$ . Then $\|G(c^{*};e^{\prime})-c^{*\prime}\|\leq L\|c^{*}-c^{*\prime}\|\leq\frac{LM}{1-L}\|e^{\prime}-e\|$ . ∎

Theorem 3 assumes $L<1$ but does not say how to verify this from the per-position Jacobian structure. The following lemma provides a practical bound: given bounds on the partial derivatives $\|\partial G_{t}/\partial c_{s}\|$ , one can choose a weighted norm that makes the global Lipschitz constant explicit.

Lemma 2 (Causal contraction bound).

Let $G(c;e)$ be the full-sequence correction operator with past-only dependencies, and suppose $\|\partial G_{t}/\partial c_{s}\|_{\textup{op}}\leq a_{t,s}$ for $s<t$ , where $\|A\|_{\textup{op}}=\sup_{\|x\|=1}\|Ax\|$ is the operator norm. For positive weights $w=(w_{1},\ldots,w_{T})$ , define $\|c\|_{w}=\max_{t}w_{t}\|c_{t}\|$ . Then $G$ is $L_{w}$ -Lipschitz in $\|\cdot\|_{w}$ with constant:

L_{w}=\max_{1\leq t\leq T}w_{t}\sum_{s=1}^{t-1}\frac{a_{t,s}}{w_{s}}.

In particular, if $L_{w}<1$ , then $G$ is a contraction.

Proof.

For each $t$ : $\|G_{t}(c)-G_{t}(c^{\prime})\|\leq\sum_{s<t}a_{t,s}\|c_{s}-c_{s}^{\prime}\|\leq\sum_{s<t}\frac{a_{t,s}}{w_{s}}\|c-c^{\prime}\|_{w}$ . Multiplying by $w_{t}$ and taking the maximum over $t$ gives $\|G(c)-G(c^{\prime})\|_{w}\leq L_{w}\|c-c^{\prime}\|_{w}$ . ∎

When $K$ is chosen at training time, one may want to know how close $c^{(K)}$ is to $c^{*}$ without computing $c^{*}$ . The next lemma gives a computable bound using only consecutive iterates.

Lemma 3 (A posteriori error bound).

If $L<1$ , then after $k$ iterations: $\|c^{(k)}-c^{*}\|\leq\frac{L}{1-L}\|c^{(k)}-c^{(k-1)}\|$ .

Proof.

By the triangle inequality, $\|c^{(k)}-c^{*}\|\leq\sum_{j=1}^{\infty}\|c^{(k+j)}-c^{(k+j-1)}\|\leq\sum_{j=1}^{\infty}L^{j}\|c^{(k)}-c^{(k-1)}\|=\frac{L}{1-L}\|c^{(k)}-c^{(k-1)}\|$ . ∎

Theorem 3(a) gives a global contraction rate $L^{k}$ that treats all positions uniformly, but this is overly pessimistic. Because $G$ is past-only, its Jacobian is strictly lower-triangular and therefore nilpotent, so position $t$ reaches its exact fixed point after at most $t$ iterations rather than $T$ . The following lemma exploits this structure to give a tighter, position-dependent error bound via causal path sums.

Lemma 4 (Finite-depth error bound).

Let $G$ be a past-only correction operator with Jacobian bounds $a_{t,s}$ . Let $A$ be the strictly lower-triangular matrix with entries $a_{t,s}$ , and $B=\max_{t}\|G_{t}(0)\|$ . After $N$ Jacobi iterations from $c^{(0)}=0$ :

\|c_{t}^{(N)}-c_{t}^{*}\|\leq\sum_{k=N}^{T-1}[A^{k}]_{t,\cdot}\mathbf{1}\cdot B=[(A^{N}(I-A)^{-1})_{t,\cdot}]\mathbf{1}\cdot B

Proof.

By the integral form of the mean value theorem, the error $e^{(k)}=c^{(k)}-c^{*}$ satisfies $e^{(k+1)}_{t}=\sum_{s<t}J_{t,s}^{(k)}e^{(k)}_{s}$ where $\|J_{t,s}^{(k)}\|\leq a_{t,s}$ by the Jacobian bound assumption. Bounding by the entrywise matrix $A$ and iterating from $e^{(0)}=-c^{*}$ gives $\|e^{(N)}_{t}\|\leq[A^{N}|c^{*}|]_{t}$ . To bound $|c^{*}|$ : the fixed point satisfies $c_{t}^{*}=G_{t}(c_{<t}^{*})$ , so $\|c_{t}^{*}\|\leq\|G_{t}(0)\|+\sum_{s<t}a_{t,s}\|c_{s}^{*}\|$ , i.e., $|c^{*}|\leq B\cdot\mathbf{1}+A|c^{*}|$ , hence $|c^{*}|\leq(I-A)^{-1}B\cdot\mathbf{1}$ (well-defined since $A$ is nilpotent). Substituting: $\|e^{(N)}_{t}\|\leq[A^{N}(I-A)^{-1}]_{t,\cdot}\mathbf{1}\cdot B$ . ∎

A.4 Proof of Proposition 1 (Depth Separation)

Formal statement.

Setup. Suppose the data is generated by a process with state update $\mathbf{s}_{t}=h^{*}(\mathbf{s}_{t-1},\mathbf{e}_{t})$ , where $\mathbf{s}_{t}\in\mathbb{R}^{m}$ is the process state and $\mathbf{e}_{t}=(\mathbf{e}_{t-W+1},\ldots,\mathbf{e}_{t})\in\mathbb{R}^{CW}$ collects the token embeddings in the current window. The context-ready architecture (Equations 1–3.1) with attention window $W$ maintains a state $\hat{\mathbf{b}}_{t}\in\mathbb{R}^{nW}$ over the full window of $W$ positions, where $n$ is the per-position state dimension. The state evolves via $\hat{\mathbf{b}}_{t}=h(\hat{\mathbf{b}}_{t-1},\mathbf{e}_{t})$ , where $h:\mathbb{R}^{nW}\times\mathbb{R}^{CW}\to\mathbb{R}^{nW}$ composes the correction FFN and block (Equations 1–3.1). Let $L_{h}$ be the Lipschitz constant of $h$ in its first argument.

To compare these two systems, let $\pi:\mathbb{R}^{m}\to\mathbb{R}^{nW}$ be a map from the process state to the architecture’s state space. The architecture faithfully tracks the process when the following diagram commutes: advancing the process state by $h^{*}$ and then projecting gives the same result as projecting and then advancing by $h$ . The commutation error

\varepsilon_{D}\;=\;\sup_{\mathbf{s},\,\mathbf{e}}\;\bigl\|h\bigl(\pi(\mathbf{s}),\,\mathbf{e}\bigr)-\pi\bigl(h^{*}(\mathbf{s},\,\mathbf{e})\bigr)\bigr\|

measures how far the diagram is from commuting: it captures both the block’s finite-depth approximation error and any information lost when the two state spaces differ.

Assumptions.

(i)

$L_{h}<1$ and $\varepsilon_{D}<\infty$ .
(ii)

Prediction sufficiency. $\pi$ preserves prediction-relevant information: $p(x_{t+1}\mid\mathbf{s}_{t})=p(x_{t+1}\mid\pi(\mathbf{s}_{t}))$ .
(iii)

Lipschitz readout. The prediction function $\phi:\mathbb{R}^{nW}\to\Delta$ is $L_{\phi}$ -Lipschitz.

Conclusions.

(a)

Context-ready error bound. The accumulated state error satisfies $\|\hat{\mathbf{b}}_{t}-\pi(\mathbf{s}_{t})\|\leq\varepsilon_{D}/(1-L_{h})$ uniformly in $t$ . By prediction sufficiency and Lipschitz readout, the prediction error is bounded by $L_{\phi}\varepsilon_{D}/(1-L_{h})$ . If $\varepsilon_{D}=0$ , then streaming is exact for all $t$ .
(b)

Standard transformer receptive-field bound. Let $N$ be the number of layers in a standard transformer with attention window $W$ . Then position $t$ has no computational path to any position before $t-NW$ , so the transformer cannot represent any function whose output at position $t$ depends on inputs before $t-NW$ .

Proof.

Part (a). At step $t$ , the architecture computes $\hat{\mathbf{b}}_{t}=h(\hat{\mathbf{b}}_{t-1},\mathbf{e}_{t})$ while the projected true state satisfies $\pi(\mathbf{s}_{t})=\pi(h^{*}(\mathbf{s}_{t-1},\mathbf{e}_{t}))$ . By the triangle inequality:

\|\hat{\mathbf{b}}_{t}-\pi(\mathbf{s}_{t})\|\;\leq\;\underbrace{\|h(\hat{\mathbf{b}}_{t-1},\mathbf{e}_{t})-h(\pi(\mathbf{s}_{t-1}),\mathbf{e}_{t})\|}_{\leq\;L_{h}\|\hat{\mathbf{b}}_{t-1}-\pi(\mathbf{s}_{t-1})\|}\;+\;\underbrace{\|h(\pi(\mathbf{s}_{t-1}),\mathbf{e}_{t})-\pi(h^{*}(\mathbf{s}_{t-1},\mathbf{e}_{t}))\|}_{\leq\;\varepsilon_{D}}.

Unrolling with $\hat{\mathbf{b}}_{0}=\pi(\mathbf{s}_{0})$ gives $\|\hat{\mathbf{b}}_{t}-\pi(\mathbf{s}_{t})\|\leq\varepsilon_{D}\sum_{j=0}^{t-1}L_{h}^{j}\leq\varepsilon_{D}/(1-L_{h})$ .

Part (b). With attention window $W$ , the output of layer $\ell$ at position $t$ depends only on positions in $[t-\ell W,\,t]$ . After $N$ layers, position $t$ has no computational path to any position before $t-NW$ . ∎

Remark (Depth separation).

Parts (a) and (b) together suggest a depth separation: the context-ready architecture propagates context through the correction chain at no additional depth cost, while a standard windowed transformer must allocate layers for propagation. When propagation and local computation cannot be interleaved—as in the pointer-chasing task where each hop requires a separate lookup—the standard transformer needs at least $\lceil T/W\rceil$ additional layers. In general, some layers may serve both roles, so $N_{\mathrm{map}}+\lceil T/W\rceil$ is an upper bound on the required depth.

Appendix B Pointer Chasing Details

Motivation. Fixed-depth transformers are confined to $\mathrm{TC}^{0}$ [Merrill and Sabharwal, 2023]: an $N$ -layer transformer can compose at most $N$ sequential reasoning steps in a single forward pass. We design a synthetic task that directly tests this depth limit. Answering a query requires chaining a variable number of sequential lookups, so a model that can only perform a fixed number of parallel steps will fail once the required chain length exceeds its depth. The context-ready architecture sidesteps this barrier because its recurrent correction chain provides sequential computation at inference, even with a single block ( $D{=}1$ ).

Task definition. The pointer-chasing task has $H$ hops and $M$ keys per level. The input contains a base table (level 0) mapping $M$ keys to values, followed by $H$ index tables (levels $1,\ldots,H$ ), each mapping $M$ keys to keys of the previous level via random permutations (bijections). After each table, a query section provides dense targets: a triplet Q key answer for every key at every level defined so far. Resolving a query at level $\ell$ requires $\ell$ sequential lookups.

Worked example ( $H{=}2$ , $M{=}3$ , 10 values). The base table maps A $\to$ v3, B $\to$ v0, C $\to$ v8. Index table 1 maps D $\to$ A, E $\to$ B, F $\to$ C. Index table 2 maps G $\to$ D, H $\to$ E, I $\to$ F. The encoding uses reversed triplets (value=key) so that causal attention can see the value to the left of the key:

Level 0 (base table + queries):
Input: v3 = A v0 = B v8 = C $|$ Target: _ _ _ _ _ _ _ _ _ _ Input: Q A v3 Q B v0 Q C v8 $|$ Target: _ v3 _ _ v0 _ _ v8 _ _

Level 1 (index table + queries):
Input: A = D B = E C = F $|$ Target: _ _ _ _ _ _ _ _ _ _ Input: Q D v3 Q E v0 Q F v8 $|$ Target: _ v3 _ _ v0 _ _ v8 _ _

Level 2 (index table + final query):
Input: D = G E = H F = I $|$ Q G Target: _ _ _ _ _ _ _ _ _ _ _ v3

Targets (bold) appear only at key positions in query sections. Level-0 queries are trivial lookups (Q A $\to$ v3). Level-1 queries require one composition (Q D $\to$ A $\to$ v3). Level-2 queries require two compositions (Q G $\to$ D $\to$ A $\to$ v3). The final token is the actual test query with no answer provided in the input. Dense targets at every level are essential: without them, the model cannot learn multi-hop composition even with BPTT.

Each level uses its own key namespace (A, B, C at level 0; D, E, F at level 1; G, H, I at level 2; etc.) to prevent ambiguity. Key ordering within each table is fixed (not shuffled), so the model can exploit positional patterns via RoPE.

Settings. $H=10$ hops, $M=5$ keys, 10 values, embedding dimension $C=256$ , 4 attention heads, batch size 64, window size 38, fixed key ordering, per-level key tokens, RoPE attention. Learning rate $1\times 10^{-4}$ .

Why windowed attention. Without windowed attention, all models—including deep transformers—can directly attend from any query position to the base table, achieving ${\sim}1/M$ accuracy without genuine composition. Windowed attention ( $w=38$ ) ensures that higher-level query sections cannot see the base table, forcing the model to chain through intermediate levels. This reveals the true depth-limited structure of fixed-depth transformers.

Wave propagation in BPTT. The $D{=}1$ model solves levels sequentially: level 0 converges first, then level 1, then 2, and so on. This is visible in the training dynamics (see progression below). The wave pattern is consistent with corrections propagating through the recurrent chain.

BPTT progression. The progression below uses a smaller configuration ( $C{=}128$ , lr $=1\text{e-3}$ , 20 values) to demonstrate wave propagation at reduced compute:

Iter	L0	L1	L2	L3	L4	L5	L6	L7	L8	L9	L10
3K	1.00	0.74	0.27	0.20	0.19	0.18	0.18	0.18	0.17	0.16	0.16
8K	1.00	1.00	0.80	0.59	0.36	0.21	0.21	0.19	0.20	0.18	0.15
13.5K	1.00	1.00	0.99	0.99	0.94	0.83	0.51	0.24	0.20	0.20	0.18
23K	1.00	1.00	1.00	1.00	1.00	0.99	0.99	0.99	0.99	0.98	0.98

20-hop scaling. At $C=512$ with $\text{lr}=1\text{e-4}$ , the same $D{=}1$ architecture solves all 21 levels (20 hops) in ${\sim}4$ K iterations, confirming that the recurrent mechanism scales to deeper composition chains.

Appendix C Extended Experimental Details

C.1 Hyperparameters

Table A1: Hyperparameters by experiment.

Hyperparameter	${\sim}85$ M	${\sim}340$ M	Token-matched	Width scaling
Block size	256	256	64	64
Batch size	64	32	1024	1024
Learning rate	$2\times 10^{-4}$	$2\times 10^{-4}$	$2\times 10^{-4}$	$2\times 10^{-4}$
Training iters	100K	200K	varies	varies
$K$ / $k_{\min}$	5 / 2	5 / 2	5 / 2	5 / 2
Dropout	0.2	0.2	0.2	0.2
Vocab size	32,000	32,000	32,000	32,000

C.2 Ablations

Table A2: Correction FFN ablation (

C=446

, Wikipedia, BPE 16k). “Roformer-hFFN” denotes a standard roformer with an extra FFN layer to match the FLOP cost of the correction FFN.

Model	FLOPs/tok	Val PPL	Seq. $K{=}1$
D=3 corr_ffn $K{=}5$	$44C^{2}$	23.98	23.96
Roformer-hFFN $N{=}3$	$44C^{2}$	25.78	—
D=3 block_head (no corr_ffn)	$36C^{2}$	27.32	28.46
Roformer $N{=}3$	$36C^{2}$	27.19	—

Table A3: Token-aware correction variants (

C=446

, Wikipedia).

$D$	Variant	FLOPs	PPL	Seq $K{=}1$	$L$
2	corr_ffn (token-blind)	$32C^{2}$	26.68	26.72	0.74
	corr_ffn_add	$32C^{2}$	26.09	26.48	0.54
	corr_ffn_concat	$36C^{2}$	25.48	25.82	0.54
3	corr_ffn (token-blind)	$44C^{2}$	23.98	23.96	—
	corr_ffn_add	$44C^{2}$	23.79	24.12	0.55
	corr_ffn_concat	$48C^{2}$	23.41	23.73	0.74

Table A4: Scale dependence (

C=50

–

768

, Wikipedia).

$C$	Comparison	Context-Ready	Baseline	$\Delta$
50	D=3 vs. Roformer-hFFN $N{=}3$	84.3	83.0	+1.3
74	D=3 vs. Roformer-hFFN $N{=}3$	62.1	61.4	+0.7
446	D=3 vs. Roformer $N{=}4$	23.79	24.85	$-$ 1.06
768	D=3 vs. Roformer $N{=}4$	18.66	20.05	$-$ 1.39

At very small widths ( $C\leq 74$ ), the correction FFN’s overhead outweighs its benefit; the correction advantage emerges at moderate widths ( $C\geq 446$ ) and grows with scale. All main-body claims are based on results at $C\geq 256$ .

Table A5:

k_{\min}

ablation (

C=50

D{=}1

K=10

Metric	$k_{\min}=2$	No $k_{\min}$
Val PPL ( $K=10$ )	84.32	84.16
Seq $K=1$	84.61	84.19
Parallel $K=1$	118.35	130.95
Empirical $L$	0.72	0.94

C.3 $K=10$ Training Details

All $K{=}10$ experiments: block size 1024, $\text{lr}=2\text{e-4}$ , softmax attention, $n_{\text{head}}=16$ , OWT.

Table A6:

D{=}1

K{=}10

(fixed, no

k_{\min}

, batch 16) vs. roformer

N{=}6

(batch 16), block size 1024.

Iter	$N{=}6$	$D{=}1$ $K{=}10$	Gap
40K	43.24	43.83	+0.59
60K	39.21	39.27	+0.06
65K	38.55	38.49	$-$ 0.06
80K	36.99	36.38	$-$ 0.61
100K	35.37	34.40	$-$ 0.97

C.4 Sequential $K{=}1$ Validation

Full depth progression for all configurations, confirming Theorem 2.

Table A7:

D{=}1

C{=}2048

depth progression across block sizes (OWT, 200K–400K iters).

Block size	Iters	Par. $K{=}1$	Par. $K{=}2$	Par. $K{=}3$	Par. $K{=}5$	Par. $K{=}10$	Seq. $K{=}1$
256	100K	84.44	43.77	40.52	39.95	40.02	40.03
256	400K	70.80	36.21	33.23	32.70	32.79	32.80
512	100K	81.02	38.25	35.01	34.35	34.41	34.43
512	200K	73.22	34.33	31.24	30.61	30.68	30.69
1024	200K	84.13	34.42	30.39	29.51	29.43	29.43

Table A8: Higher-

D

depth progression (

C{=}1024

, block size 256, OWT).

$D$	Par. $K{=}1$	Par. $K{=}2$	Par. $K{=}3$	Par. $K{=}5$	Par. $K{=}10$	Seq. $K{=}1$
5	58.17	40.18	38.80	38.60	38.61	38.62
8	51.60	40.12	39.22	39.10	39.10	39.10
12	38.33	32.58	32.30	32.29	32.29	32.29
23	32.80	29.03	28.89	28.88	28.88	28.88

At higher $D$ , convergence is faster: $D{=}12$ and $D{=}23$ match $K{=}5$ within 0.01 PPL at $K{=}3$ ; $D{=}8$ is within 0.12 PPL. Parallel $K{=}1$ ratio to actual quality shrinks with $D$ (from $2.85\times$ at $D{=}1$ to $1.14\times$ at $D{=}23$ ), confirming that the correction mechanism accounts for a larger fraction of quality at low $D$ .

C.5 Block Size Scaling

Longer context lengths give the $D{=}1$ correction chain more sequential steps to accumulate depth, so the gap to $N{=}6$ should shrink with block size. Table A9 confirms this: at $K{=}5$ , the gap narrows from $+1.19$ at block size 256 to $+0.31$ at 1024. With $K{=}10$ at block size 1024, $D{=}1$ overtakes $N{=}6$ entirely (Section 5.5).

Table A9:

D{=}1

C{=}2048

vs.

N{=}6

C{=}1088

gap at 200K iterations across block sizes (OWT).

Block size	$N{=}6$ PPL	$D{=}1$ PPL	Gap
256	34.15	35.34	+1.19
512	30.11	30.57	+0.46
1024	29.22	29.53	+0.31

C.6 Token-Matched Training Curves

To isolate the effect of the correction mechanism from FLOP differences, we compare $D{=}x$ context-ready against $N{=}x$ standard transformers at the same embedding dimension ( $C=1024$ , block size 64), so both see the same number of tokens per training iteration. At every depth tested, the context-ready model overtakes the baseline after a crossover point and the gap continues to grow.

Table A10: Token-matched results at

C=1024

, block size 64, OWT. Final values at 1,126M tokens.

Comparison	$N{=}x$ PPL	$D{=}x$ PPL	Gap	Crossover
$D{=}1$ vs. $N{=}1$	114.8	80.4	$-$ 34.4	${\sim}424$ M
$D{=}2$ vs. $N{=}2$	73.7	66.1	$-$ 7.6	${\sim}565$ M
$D{=}3$ vs. $N{=}3$	62.4	59.4	$-$ 3.0	${\sim}835$ M
$D{=}6$ vs. $N{=}6$	53.0	52.2	$-$ 0.7	${\sim}1{,}032$ M

C.7 Fine-Tuning Details

Any pretrained $N$ -layer transformer can be converted to a $D{=}N$ context-ready model by adding a zero-initialized correction FFN and fine-tuning. The zero initialization ensures no disruption at conversion: the correction is identically zero, so the model behaves exactly as the original transformer. As fine-tuning progresses, the correction FFN learns to exploit cached context, yielding PPL improvements. At $D{=}12$ , fine-tuning causes a transient $+0.11$ PPL increase before recovering; at $D{=}24$ , there is no transient increase.

Table A11: Fine-tuning pretrained roformers to context-ready (block size 256, OWT). “Continued baseline” is the roformer trained for the same total iterations without conversion.

^∗Gain vs. pre-conversion checkpoint; continued-training control not available.
Conversion	Baseline	Fine-tuned	Cont. baseline	$\Delta$ vs. cont.	FT iters
$N{=}12$ $C{=}1408$ $\to$ $D{=}12$	29.92	26.14	27.20	$-$ 1.06	200K
$N{=}24$ $C{=}1024$ $\to$ $D{=}24$	29.42	28.99	—	$-$ 0.43^∗	18K
$N{=}12$ $C{=}1024$ $\to$ $D{=}12$	33.41	32.21	—	$-$ 1.20^∗	50K

C.8 Wikipedia Results

For completeness, we include results on English Wikipedia (BPE 16k, context 256, 100K iterations).

Table A12: Context-ready vs. baselines on Wikipedia. FLOPs/tok includes the prediction head (

VC

), which is identical within each width group.

	Model	FLOPs/tok	Val PPL	$\Delta$
$C{=}768$	D=5 concat	42.5M	16.69
$C{=}768$	Roformer $N{=}6$	42.5M	17.95	$-$ 1.26
$C{=}446$	D=6 add	15.9M	20.40
	Roformer-hFFN $N{=}6$	15.9M	21.44	$-$ 1.04
	D=2 concat	7.2M	25.48
	Roformer $N{=}3$	7.2M	27.19	$-$ 1.71

C.9 Inference Timing

Autoregressive generation speed measured on a single A100 GPU over 10,000 tokens with KV caching, batch size 1 (single-sequence generation).

Table A13: Inference latency: context-ready vs. standard transformers (A100, 10K tokens, KV caching).

Model	Params	tok/s	ms/tok	Speedup
$D{=}1$ $C{=}2048$	215M	919	1.09	$2.6\times$
Roformer $N{=}6$ $C{=}1088$	155M	351	2.85	$2.6\times$
$D{=}5$ $C{=}1120$	157M	349	2.86	$1.7\times$
Roformer $N{=}12$ $C{=}768$	134M	201	4.96	$1.7\times$

Per-token latency is flat across sequence length in both comparisons ( $<\!4\%$ growth from $T{=}100$ to $T{=}10{,}000$ ), confirming that KV caching amortizes attention cost to $O(T)$ per token. The context-ready models are faster despite having more parameters, because fewer sequential layers dominate inference latency on modern GPUs. In addition, fewer layers reduce total KV cache memory ( $C\times D$ per token): $D{=}1$ $C{=}2048$ uses $3.2\times$ less cache than $N{=}6$ $C{=}1088$ ; $D{=}5$ $C{=}1120$ uses $1.6\times$ less than $N{=}12$ $C{=}768$ .

The Context-Ready Transformer

Abstract

1 Introduction

2 Related Work

3 Method

3.1 Architecture

3.2 Parallel Training

3.3 Streaming Inference

4 Theoretical Analysis

Theorem 1 (Structural characterization).

Theorem 2 (Exact streaming fixed point).

Theorem 3 (Training convergence).

Proposition 1 (Depth separation).

5 Experiments

5.1 Setup

5.2 Cross-Width Results

5.3 Correction Efficiency

5.4 Width Scaling

5.5 Single-Layer Performance (D=1D{=}1)

5.6 Pointer Chasing: Depth Separation

5.7 Fine-Tuning from Pretrained

5.8 Ablations and Diagnostics

6 Conclusion

References

NeurIPS Paper Checklist

Appendix A Full Proofs

A.1 Proof of Theorem 1 (Structural Characterization)

Proof.

A.2 Proof of Theorem 2 (Exact Streaming)

Lemma 1 (Finite-step exactness).

Proof.

Proof of Theorem 2.

A.3 Proof of Theorem 3 (Convergence)

Proof.

Lemma 2 (Causal contraction bound).

Proof.

Lemma 3 (A posteriori error bound).

Proof.

Lemma 4 (Finite-depth error bound).

Proof.

A.4 Proof of Proposition 1 (Depth Separation)

Proof.

Remark (Depth separation).

Appendix B Pointer Chasing Details

Appendix C Extended Experimental Details

C.1 Hyperparameters

C.2 Ablations

C.3 K=10K=10 Training Details

C.4 Sequential K=1K{=}1 Validation

C.5 Block Size Scaling

C.6 Token-Matched Training Curves

C.7 Fine-Tuning Details

C.8 Wikipedia Results

C.9 Inference Timing

5.5 Single-Layer Performance ( $D{=}1$ )

C.3 $K=10$ Training Details

C.4 Sequential $K{=}1$ Validation