License: CC BY-NC-ND 4.0
arXiv:2606.27538v1 [cs.CL] 25 Jun 2026

The Context-Ready Transformer

Mahesh Godavarti
A Carrot, Inc
m@acarrot.com
Abstract

We introduce the context-ready transformer, a new recurrent neural network architecture built from a DD-layer transformer block that pre-contextualizes each token before it enters the block. During left-to-right generation, a correction network combines the previous position’s block output—a cached summary of past context—with the current token embedding, so the token enters the block already contextualized rather than as a raw embedding. At sequential inference, the correction chain makes the architecture a recurrent neural network. For training, we unroll the correction process KK times over the full sequence, processing all positions in parallel at each step. A pretrained transformer can also be converted to a context-ready model by adding a zero-initialized correction FFN and fine-tuning. We evaluate across widths, depths, block sizes, and two datasets, with all comparisons against standard transformers, variants, and ablations. A D=5D{=}5 model beats a 12-layer transformer while generating 1.7×1.7\times faster on an A100. With K=10K{=}10, a single-layer model (D=1D{=}1) beats a 6-layer transformer with a 2.6×2.6\times inference speedup, and sequential inference matches parallel K=10K{=}10 to within 0.01 PPL. The architecture benefits most from wide representations and long contexts. On a pointer-chasing task, D=1D{=}1 trained with BPTT solves all 10 composition levels, while standard transformers exhibit staircase-like depth dependence.

1 Introduction

In autoregressive generation, a standard NN-layer transformer predicts the next token, assigns it a context-free base embedding, and devotes part or all of its NN layers to re-contextualize it. This round-trip from context to token ID back to context is an artifact of the architecture, not of the problem.

Context-ready inference. The context-ready transformer shortens this round-trip. Consider sequential left-to-right generation: as the model processes token tt, the block output ztz_{t} encodes the context of tokens 0,,t0,\ldots,t. When token t+1t{+}1 arrives with embedding et+1e_{t+1}, a correction network computes a correction from ztz_{t} and et+1e_{t+1}. The token enters the block as et+1e_{t+1} plus this correction—carrying contextual information from all preceding tokens—rather than as a raw embedding. Fewer layers are needed to fully contextualize the token, because the correction has already done part of the work.

The training challenge. This sequential process is exact at inference but inherently serial: each position’s correction depends on the previous position’s fully computed block output. A classical RNN faces the same issue and solves it with backpropagation through time (BPTT), which unrolls the recurrence for all TT positions—making the sequential training depth scale with sequence length.

We take a different approach. We approximate the sequential process by unrolling it KK times over the full sequence, where KK is a small constant independent of TT. All TT positions are still processed in parallel—just as in a standard transformer—but the correction is refined over KK unrolling steps rather than TT. Starting from raw embeddings at all positions, we run a shared-weight block and compute the correction at each position tt from the block output at positions 0,,t10,\ldots,t{-}1. We then update all embeddings with these corrections and repeat. Each unrolling step refines the corrections using increasingly contextualized inputs. Crucially, KK controls the depth of the computation graph—not TT—so a classical RNN requires O(T)O(T)-deep unrolling while the context-ready transformer requires only O(K)O(K). In practice, K=5K=5 suffices for convergence at the depths tested (D5D\geq 5; see Table 6).

How KK unrolling steps at training lead to zero iterations at inference. Two design choices make this possible.

  • Non-cumulative correction. Each iteration computes a correction from scratch rather than building on the previous one. The iteration takes the form x(k)=e+Fθ(x(k1),e)x^{(k)}=e+F_{\theta}(x^{(k-1)},e): the base embedding plus a correction that depends on both the previous iterate’s block output and the token embedding. Unlike a residual network, which computes x(K)=e+f(x(0))++f(x(K1))x^{(K)}=e+f(x^{(0)})+\cdots+f(x^{(K-1)})—a sum of KK corrections—our formulation yields x(K)=e+Fθ(x(K1),e)x^{(K)}=e+F_{\theta}(x^{(K-1)},e): a single correction. Previous iterations serve only to bring x(K1)x^{(K-1)} close to the fixed point; once it has converged, additional iterations produce the same output.

  • Past-only contextualization. The correction at position tt depends on two quantities: the block output zt1z_{t-1}, which encodes the context of tokens 0,,t10,\ldots,t{-}1, and the token embedding ete_{t}. Since zt1z_{t-1} is already cached from processing the previous token, the correction can be computed as soon as token tt arrives.

Together, these two properties mean that sequential generation naturally produces the converged correction for each new token without iteration (Section 4).

A new kind of RNN. At sequential inference, the correction at position tt depends on zt1z_{t-1}, which depends on zt2z_{t-2}, and so on—creating a recurrent computation that unfolds over all TT positions. The non-cumulative + past-only structure is what enables efficient training: it converts the sequential recurrence into a fixed-point problem, so instead of exact TT-step BPTT, we can train by unrolling KK steps over the full sequence in parallel (KTK\ll T). This is not equivalent to BPTT—positions beyond the first KK receive approximate rather than exact gradients—but in practice K=5K=5 suffices at the depths tested.

Experimental evidence. We evaluate across widths C{50,,2048}C\in\{50,\ldots,2048\}, depths D{1,,23}D\in\{1,\ldots,23\}, block sizes {64,256,512,1024}\{64,256,512,1024\}, two datasets (OpenWebText, Wikipedia), and a synthetic reasoning task (Section 5). The most impactful findings: D=5D{=}5 at C=1120C{=}1120 beats a 12-layer transformer (C=768C{=}768), halving inference depth and generating 1.7×1.7\times faster on an A100. With K=10K{=}10, D=1D{=}1 at C=2048C{=}2048 beats a 6-layer transformer (C=1088C{=}1088) with a 2.6×2.6\times inference speedup. Sequential inference matches K=10K{=}10 training to within 0.01 PPL, confirming that streaming exactness holds in practice. On pointer chasing, D=1D{=}1 trained with BPTT solves all 10 composition levels while standard transformers exhibit staircase-like depth dependence. The architecture benefits most from wide representations and long contexts, and requires a dedicated correction network to be effective (Section 5.8). Any pretrained transformer can be converted by adding a zero-initialized correction FFN and fine-tuning.

2 Related Work

Weight-shared and depth-adaptive architectures. ALBERT (Lan et al., 2020), Universal Transformers (Dehghani et al., 2019) with ACT (Graves, 2016), Deep Equilibrium Models (Bai et al., 2019), and Huginn (Geiping et al., 2025) share weights across layers or iterations and iteratively refine hidden states, but do not use a dedicated past-output/token-aware pre-block correction of the kind proposed here. We compare against standard transformers as the representative baseline for how tokens enter the block. Weight sharing in the context-ready transformer arises naturally from unrolling the sequential correction process into training iterations that process all positions in parallel, not as a design choice for parameter efficiency.

Early exit and layer skipping. LayerSkip (Elhoushi et al., 2024), ADEPT (Yoo et al., 2026), and PonderNet (Banino et al., 2021) reduce average layer count but require learned stopping mechanisms. Context-ready uses a fixed depth with no stopping criterion.

Lookahead Decoding (Fu et al., 2024) and CLLMs (Kou et al., 2024) apply Jacobi iteration to standard transformers as a decoding strategy. Bai et al. (2021) accelerate DEQ inference by parallelizing fixed-point solves via Jacobi-style updates. Context-ready is an architectural change: the KK-step unrolling can be viewed as Jacobi iterations on the correction fixed-point equation, but the non-cumulative past-only structure guarantees that a single left-to-right streaming pass recovers the exact correction without any iteration.

Subquadratic and recurrent alternatives. Mamba (Gu and Dao, 2024), RWKV (Peng et al., 2023), Griffin (De et al., 2024), and xLSTM (Beck et al., 2024) replace causal self-attention with compressed recurrent state to achieve linear-time inference. The context-ready transformer solves a different problem: it retains full causal self-attention inside the block and instead changes how tokens enter it. The two approaches are complementary, not competing—one could in principle apply pre-block correction to any of these architectures—and standard transformers remain the natural baseline for evaluating the correction mechanism.

Computational complexity of transformers. Log-precision fixed-depth transformers are confined to TC0\mathrm{TC}^{0} (Merrill and Sabharwal, 2023): they cannot solve problems requiring unbounded sequential composition, regardless of width. RNNs with arbitrary precision escape this limitation (Siegelmann and Sontag, 1995a; Siegelmann and Sontag, 1995b), but classical RNNs require gradients to flow through all TT time steps (BPTT), making them difficult to train on long sequences. The context-ready transformer at D=1D=1 is recurrent at inference, but is trained by unrolling the correction KK times rather than exact TT-step BPTT—inheriting recurrent structure at inference while retaining transformer-style parallel training.

3 Method

3.1 Architecture

We first describe the architecture during sequential left-to-right generation—the setting where context-ready inference is exact. Let TT denote the sequence length, CC the embedding dimension, and VV the vocabulary size. The core component is a DD-block unit: DD transformer layers (each consisting of causal self-attention and a feed-forward network with residual connections), described in detail below. Processing token tt requires the outputs z0,,zt1Cz_{0},\ldots,z_{t-1}\in\mathbb{R}^{C} of this block unit from all preceding tokens. Let etCe_{t}\in\mathbb{R}^{C} denote the token embedding at position tt, and define z1=𝟎z_{-1}=\mathbf{0}.

Correction. A dedicated correction FFN generates the correction for position tt from the cached block output zt1z_{t-1} and the current token embedding ete_{t}:

correctiont=corr_ffn(LN(zt1+et))\texttt{correction}_{t}=\texttt{corr\_ffn}\bigl(\texttt{LN}(z_{t-1}+e_{t})\bigr) (1)

The correction FFN is a feed-forward network (Linear(C4CC\to 4C) \to GELU \to Linear(4CC4C\to C)) with its own weights, separate from the block’s FFN. The correction is token-aware: it depends on both zt1z_{t-1} (context of tokens 0,,t10,\ldots,t{-}1) and ete_{t} (the current token embedding). Since both inputs are available when token tt arrives, the correction is causal—it depends on no future tokens.

Contextualization. The new token enters the block with the correction added to its raw embedding:

x~t=et+correctiont\tilde{x}_{t}=e_{t}+\texttt{correction}_{t} (2)

Block processing. A DD-block unit applies DD transformer blocks with separate weights and standard residual connections:

h(0)\displaystyle h^{(0)} =x~t\displaystyle=\tilde{x}_{t}
a(i)\displaystyle a^{(i)} =h(i1)+Attni(LNia(h(i1));𝒦i,𝒱i),i=1,,D\displaystyle=h^{(i-1)}+\texttt{Attn}_{i}\bigl(\texttt{LN}^{a}_{i}(h^{(i-1)});\;\mathcal{K}_{i},\mathcal{V}_{i}\bigr),\quad i=1,\ldots,D
h(i)\displaystyle h^{(i)} =a(i)+FFNi(LNif(a(i)))\displaystyle=a^{(i)}+\texttt{FFN}_{i}\bigl(\texttt{LN}^{f}_{i}(a^{(i)})\bigr)
zt\displaystyle z_{t} =h(D)\displaystyle=h^{(D)} (3)

Each Attni\texttt{Attn}_{i} is causal self-attention with Rotary Position Embeddings (RoPE) (Su et al., 2024). Each FFNi\texttt{FFN}_{i}: Linear(C4CC\to 4C) \to GELU \to Linear(4CC4C\to C). The parameter DD controls inference depth.

Prediction.

logitst=WheadLNf(zt),WheadV×C\texttt{logits}_{t}=W_{\text{head}}\cdot\texttt{LN}_{f}(z_{t}),\quad W_{\text{head}}\in\mathbb{R}^{V\times C} (4)

After prediction, ztz_{t} is cached and the KV caches 𝒦i,𝒱i\mathcal{K}_{i},\mathcal{V}_{i} are updated for future tokens.

3.2 Parallel Training

Sequential inference is exact but inherently serial. For training, we unroll the correction process KK times, processing all TT positions in parallel at each step. Gradients flow through KK unrolling steps rather than TT time steps, making the architecture trainable like a transformer despite being recurrent at inference.

Given token embeddings e=(e1,,eT)e=(e_{1},\ldots,e_{T}), with z0(k)=𝟎z_{0}^{(k)}=\mathbf{0} for all kk (the initial cache from Section 3.1), initialize correction(0)=𝟎\texttt{correction}^{(0)}=\mathbf{0}. For k=1,,Kk=1,\ldots,K:

x~t(k1)\displaystyle\tilde{x}_{t}^{(k-1)} =et+correctiont(k1)\displaystyle=e_{t}+\texttt{correction}_{t}^{(k-1)} (contextualize)
h(0)\displaystyle h^{(0)} =x~t(k1)\displaystyle=\tilde{x}_{t}^{(k-1)}
a(i)\displaystyle a^{(i)} =h(i1)+Attni(LNia(h(i1));𝒦i,𝒱i)\displaystyle=h^{(i-1)}+\texttt{Attn}_{i}\bigl(\texttt{LN}^{a}_{i}(h^{(i-1)});\;\mathcal{K}_{i},\mathcal{V}_{i}\bigr) i=1,,D\displaystyle i=1,\ldots,D
h(i)\displaystyle h^{(i)} =a(i)+FFNi(LNif(a(i)))\displaystyle=a^{(i)}+\texttt{FFN}_{i}\bigl(\texttt{LN}^{f}_{i}(a^{(i)})\bigr)
zt(k)\displaystyle z_{t}^{(k)} =h(D)\displaystyle=h^{(D)} (block output)
correctiont(k)\displaystyle\texttt{correction}_{t}^{(k)} =corr_ffn(LN(zt1(k)+et))\displaystyle=\texttt{corr\_ffn}\bigl(\texttt{LN}(z_{t-1}^{(k)}+e_{t})\bigr) t=1,,T\displaystyle t=1,\ldots,T (5)

The DD transformer blocks share weights across iterations kk but have separate weights across layers ii.

Non-cumulative correction. Each iteration replaces the previous correction entirely: x~(k)=e+correction(k)\tilde{x}^{(k)}=e+\texttt{correction}^{(k)}, not x~(k)=x~(k1)+f(x~(k1))\tilde{x}^{(k)}=\tilde{x}^{(k-1)}+f(\tilde{x}^{(k-1)}). Only the last correction matters.

Past-only correction. The correction at position tt uses zt1(k)z_{t-1}^{(k)}. Corrections propagate left to right: position 0 converges after one iteration, position 11 after two, and so on.

Random-depth training (kmink_{\min}). We sample KUniform(kmin,Kmax)K\sim\text{Uniform}(k_{\min},K_{\max}) each batch with kmin=2k_{\min}=2, forcing the model to produce good predictions at any depth, which empirically encourages contraction.

Loss and dropout. The training loss (cross-entropy) is computed on the logits from the final iteration KK only. Dropout masks are resampled independently at each unrolling step kk.

3.3 Streaming Inference

Algorithm 1 Context-Ready Streaming Inference
0: Blocks Attn1,FFN1,,AttnD,FFND\texttt{Attn}_{1},\texttt{FFN}_{1},\ldots,\texttt{Attn}_{D},\texttt{FFN}_{D}; correction FFN; KV caches 𝒞1,,𝒞D\mathcal{C}_{1},\ldots,\mathcal{C}_{D}; previous block output zprevz_{\text{prev}} (init. 𝟎\mathbf{0})
1:for each new token with embedding ee do
2:  correctioncorr_ffn(LN(zprev+e))\texttt{correction}\leftarrow\texttt{corr\_ffn}\bigl(\texttt{LN}(z_{\text{prev}}+e)\bigr) {from past context + token identity}
3:  he+correctionh\leftarrow e+\texttt{correction} {contextualized input}
4:  for i=1,,Di=1,\ldots,D do
5:   hh+Attni(LNia(h);𝒞i)h\leftarrow h+\texttt{Attn}_{i}(\texttt{LN}^{a}_{i}(h);\;\mathcal{C}_{i});  update 𝒞i\mathcal{C}_{i}
6:   hh+FFNi(LNif(h))h\leftarrow h+\texttt{FFN}_{i}(\texttt{LN}^{f}_{i}(h))
7:  end for
8:  zprevhz_{\text{prev}}\leftarrow h {cache for next token’s correction}
9:  logitsWheadLNf(h)\texttt{logits}\leftarrow W_{\text{head}}\cdot\texttt{LN}_{f}(h)
10:end for

When a new token arrives with embedding ete_{t}, the model computes the correction from the cached zt1z_{t-1} and ete_{t}, passes the corrected embedding through the DD-block unit, caches the output ztz_{t}, and predicts. This is one forward pass—no iteration over KK steps—regardless of the training depth KK.

Why inference needs no iteration. During training, KK unrolling steps refine the corrections. At inference, this iteration is unnecessary: since earlier positions are already computed and cached, the correction for token tt is fully determined by zt1z_{t-1} and ete_{t} in a single pass (Theorem 2). The training approximation matters only during training: the first KK positions converge exactly after KK steps (Lemma 1 in Appendix A.2), and for later positions, the approximation error shrinks geometrically with KK when the correction operator is contractive (Theorem 3).

4 Theoretical Analysis

Full formal statements and proofs are in Appendix A.

Theorem 1 (Structural characterization).

Why non-cumulative and past-only? Under Assumptions I–II (Appendix A.1), if a weight-shared architecture unrolls a shared block KK times during training and applies it once per token during streaming, then for the unrolled training to converge to the same output that streaming produces, the correction must be non-cumulative (x~=e+correction\tilde{x}=e+\texttt{correction}, not a sum of successive increments) and past-only (the correction at position tt depends only on ete_{t} and corrections from positions 1,,t11,\ldots,t{-}1). The resulting system has a unique fixed point, and streaming computes it exactly.

Full proof in Appendix A.1.

Theorem 2 (Exact streaming fixed point).

Why is inference exact without iteration? During sequential generation, the correction at position tt depends only on z0,,zt1z_{0},\ldots,z_{t-1}, which are already computed and cached. By prefix consistency (appending tokens does not change the operator at earlier positions, which holds by causal masking), the correction is exact in a single pass.

Full proof in Appendix A.2.

Theorem 3 (Training convergence).

How fast does the training iteration converge? If the correction operator GG is LL-Lipschitz with L<1L<1, then KK unrolling steps reduce the error to the fixed point by a factor of LKL^{K}. This governs the training approximation: positions beyond the first KK receive approximate corrections, and the approximation improves geometrically with KK.

Full formal statement in Appendix A.3.

Proposition 1 (Depth separation).

Why can the context-ready architecture use fewer layers? Under a stylized state-tracking abstraction (Appendix A.4):

  1. (a)

    Context-ready propagation is handled by the correction chain. The context-ready architecture needs only DD layers for the per-token map; propagation across the sequence is handled by the recurrent correction chain rather than extra transformer layers.

  2. (b)

    Standard transformers need depth for propagation. With attention window WW, a standard transformer needs at least T/W\lceil T/W\rceil layers just for information from the earliest tokens to reach position TT, on top of the layers needed for the per-token map.

Full statement and proof in Appendix A.4. When propagation and local computation cannot be interleaved (as in pointer chasing), the standard transformer needs at least T/W\lceil T/W\rceil additional layers; in general, some layers may serve both roles.

5 Experiments

5.1 Setup

Data. OpenWebText (Gokaslan and Cohen, 2019) with byte-pair encoding (BPE, vocabulary 32,000). Context lengths: 64, 256, 512, and 1024 depending on the experiment. Ablations use English Wikipedia (BPE 16k); additional Wikipedia results in Appendix C.8. All results are validation perplexity (PPL) on held-out splits.

Training. AdamW optimizer (Loshchilov and Hutter, 2019), gradient clipping at 1.0. Learning rate 2×1042\times 10^{-4} unless noted. Training depth K=5K=5 with kmin=2k_{\min}=2 by default; K=10K=10 where noted. Dropout 0.2, FFN expansion factor 4. Full hyperparameters in Appendix C.

FLOP accounting.111Following standard convention in the literature, we count multiply-accumulate operations and label them FLOPs; actual floating-point operations are roughly 2×2\times. We report total FLOPs per token, including the transformer blocks, correction FFN, and prediction head (WheadV×CW_{\text{head}}\in\mathbb{R}^{V\times C}, costing VCVC FLOPs/token). Each transformer block costs 12C212C^{2} FLOPs (4C24C^{2} for attention projections, 8C28C^{2} for the FFN). A context-ready model with DD blocks costs D×12C2+8C2+VCD\times 12C^{2}+8C^{2}+VC; a standard NN-layer transformer costs N×12C2+VCN\times 12C^{2}+VC. Same-width comparisons (Tables 23) share the same prediction head and are FLOP-matched. Cross-width comparisons (Table 1) explore depth-width tradeoffs: wider models pay a larger prediction head, so total FLOPs differ. Despite higher total FLOPs, the wider, shallower context-ready models deliver lower wall-clock inference time because fewer sequential layers dominate latency on modern GPUs (Section 5.4).

Baselines. Standard transformers with separate weights per layer and RoPE attention (Su et al., 2024). All results are single runs. The breadth of the evaluation—across widths (C=50C{=}5020482048), depths (D=1D{=}12323), block sizes (64–1024), two datasets, a synthetic task, and multiple training strategies—provides stronger evidence than multi-seed runs on a single configuration: the context-ready architecture wins consistently across all these axes.

5.2 Cross-Width Results

Table 1: Context-ready vs. standard transformers on OpenWebText at two compute scales. FLOPs/tok includes the prediction head (VCVC). Despite higher total FLOPs, the wider context-ready models provide lower wall-clock inference time (Section 5.4). All context-ready models use add variant, K=5K{=}5, kmin=2k_{\min}{=}2.
Model FLOPs/tok C/NC/N Val PPL Δ\Delta
D=5 C=1120C{=}1120 121M 224 36.38
D=6 C=1024C{=}1024 117M 171 36.56
Roformer N=6N{=}6 C=1088C{=}1088 120M 181 37.76 -1.38
Roformer N=12N{=}12 C=768C{=}768 110M 64 37.83 -1.45
Roformer N=2N{=}2 C=1888C{=}1888 146M 944 42.99
Roformer N=24N{=}24 C=1088C{=}1088 376M 45 28.68
Roformer N=12N{=}12 C=1536C{=}1536 389M 128 29.01
D=6 C=2048C{=}2048 401M 341 29.04 ++0.03
Roformer N=6N{=}6 C=2176C{=}2176 411M 363 30.35

Table 1 compares context-ready models against standard transformers across depth-width tradeoffs at two compute scales. At the smaller scale (block size 256, 100K iterations), context-ready D=5D{=}5 at C=1120C{=}1120 achieves 36.38 PPL, beating both roformer N=6N{=}6 at C=1088C{=}1088 (37.76, Δ=1.38\Delta=-1.38) and roformer N=12N{=}12 at C=768C{=}768 (37.83, Δ=1.45\Delta=-1.45). At the larger scale (block size 256, 200K iterations), context-ready D=6D{=}6 at C=2048C{=}2048 (29.04) matches roformer N=12N{=}12 at C=1536C{=}1536 (29.01), despite using a much higher width-to-depth ratio. Depth has diminishing returns: going from N=12N{=}12 to N=24N{=}24 gains only 0.33 PPL.

5.3 Correction Efficiency

Table 2: Correction efficiency at C=1024C=1024, block size 256, 200K iterations (OWT). Same width throughout, so the prediction head is identical across all rows and the comparison is FLOP-matched.
Model Inference FLOPs Val PPL
Roformer N=12N{=}12 144C2144C^{2} 33.41
Roformer N=13N{=}13 156C2156C^{2} 32.82
D=12 context-ready 𝟏𝟓𝟐𝐂𝟐\mathbf{152C^{2}} 32.28
Roformer N=14N{=}14 168C2168C^{2} 32.34
Roformer N=24N{=}24 288C2288C^{2} 29.42
D=23 context-ready 𝟐𝟖𝟒𝐂𝟐\mathbf{284C^{2}} 28.89

Table 2 isolates the value of the correction mechanism at fixed width (C=1024C=1024), so the prediction head is identical across all rows and the comparison is FLOP-matched. D=12D=12 context-ready (152C2152C^{2} FLOPs) beats roformer N=13N{=}13 (156C2156C^{2}) by 0.54 PPL and matches roformer N=14N{=}14 (168C2168C^{2}). The correction FFN adds only 8C28C^{2} FLOPs yet provides a genuine PPL improvement at the same parameter budget. At deeper scale, D=23D=23 (284C2284C^{2}) edges out N=24N=24 (288C2288C^{2}) by 0.53 PPL—directionally consistent with the D=12D{=}12 result, but a single-run margin that should be read as evidence of parity rather than robust superiority.

5.4 Width Scaling

Table 3: Width scaling: D=2D{=}2 context-ready vs. roformer N=2N{=}2 at the same width (block size 64, Chinchilla-matched token budgets, OWT). The relative advantage grows with width.
CC N=2N{=}2 PPL D=2D{=}2 PPL Δ\Delta Relative
256 158.83 143.57 -15.26 9.6%
512 95.48 84.69 -10.79 11.3%
1024 72.83 60.84 -11.99 16.5%

Table 3 tests whether the correction advantage is a small-scale artifact. Comparing D=2D{=}2 vs. N=2N{=}2 at the same width with Chinchilla-matched token budgets, the relative improvement grows from 9.6% at C=256C=256 to 16.5% at C=1024C=1024.

Token-matched results. At C=1024C=1024 with block size 64, D=xD{=}x beats N=xN{=}x at every depth tested once training progresses past a crossover point: D=1D{=}1 beats N=1N{=}1 by 34.4 PPL (crossover at 424{\sim}424M tokens), D=2D{=}2 by 7.6 PPL (565{\sim}565M), D=3D{=}3 by 3.0 PPL (835{\sim}835M), and D=6D{=}6 by 0.7 PPL (1,032{\sim}1{,}032M). All gaps are still growing at the end of training. The correction mechanism provides a consistent advantage at every depth; the advantage is largest when depth is small, consistent with the correction doing the most work when the block has the fewest layers.

Inference latency. We measure autoregressive generation speed on an A100 over 10,000 tokens with KV caching. D=1D{=}1 C=2048C{=}2048 (149M FLOPs/tok) generates at 919 tokens/s vs. 351 tokens/s for roformer N=6N{=}6 C=1088C{=}1088 (120M FLOPs/tok)—a 2.6×2.6\times speedup despite higher total FLOPs. D=5D{=}5 C=1120C{=}1120 (121M FLOPs/tok) generates at 349 tokens/s vs. 201 tokens/s for roformer N=12N{=}12 C=768C{=}768 (110M FLOPs/tok)—a 1.7×1.7\times speedup. The wider, shallower models are faster because fewer sequential layers dominate inference latency on modern GPUs, even when total FLOPs are higher. Per-token latency is flat across sequence length, confirming that KV caching amortizes attention cost to O(T)O(T) per token. Full timing details in Appendix C.9.

KV cache savings. Fewer layers also reduce KV cache memory (C×DC\times D per token). Despite being wider, D=5D{=}5 at C=1120C{=}1120 uses 1.6×1.6\times less cache than N=12N{=}12 at C=768C{=}768; D=1D{=}1 at C=2048C{=}2048 uses 3.2×3.2\times less than N=6N{=}6 at C=1088C{=}1088.

5.5 Single-Layer Performance (D=1D{=}1)

Proposition 1(a) predicts that D=1D{=}1 may match deeper transformers when the task is dominated by context propagation and sufficient KK and width are provided. We test D=1D{=}1, C=2048C{=}2048 (149M FLOPs/tok) against roformer N=6N{=}6, C=1088C{=}1088 (120M FLOPs/tok) at block size 1024 using three training strategies: fixed K=10K{=}10, random-depth K=10K{=}10 with kmin=2k_{\min}{=}2, and fine-tuning from a pretrained N=1N{=}1 roformer (batch 16).

Table 4: Three training strategies for D=1D{=}1 C=2048C{=}2048 (149M FLOPs/tok) vs. roformer N=6N{=}6 C=1088C{=}1088 (120M FLOPs/tok), block size 1024, OWT. Fine-tune starts from N=1N{=}1 C=2048C{=}2048 pretrained for 85K iterations; “total iters” includes pretraining.
Strategy PPL at 100K Best PPL Notes
Roformer N=6N{=}6 (baseline) 35.37
K=10K{=}10, fixed depth 34.40 33.63 (110K) Beats N=6N{=}6 at {\sim}65K
K=10K{=}10, kmin=2k_{\min}{=}2 36.28 33.63 (135K) +1.88 penalty at 100K
K=10K{=}10, fine-tuned 46.66 31.35 (215K) Best final PPL
At 100K total (15K fine-tune); fine-tune reaches 35.92 at 150K total.

Table 4 compares the three strategies. In these D=1D{=}1 experiments, larger training depth KK substantially improves the model’s ability to exploit the available recurrent depth. With fixed K=10K{=}10, D=1D{=}1 surpasses N=6N{=}6 at 65{\sim}65K iterations and reaches 34.40 at 100K (Δ=0.97\Delta=-0.97, still improving) at 2×2\times per-iteration cost.

Random-depth training (kmin=2k_{\min}{=}2) incurs a modest penalty of 1.88 PPL at 100K relative to fixed depth, but converges to the same quality 25{\sim}25K later (both reach 33.63). The penalty yields L^\hat{L} values consistent with contraction (L^0.55\hat{L}\approx 0.55 vs. L^>1.0\hat{L}>1.0 for fixed KK).

Fine-tuning from a pretrained N=1N{=}1 roformer (K=1K{=}1) with a zero-initialized correction FFN at K=10K{=}10 reaches the best final PPL: 31.35 at 215K total iterations (the N=6N{=}6 baseline is at 100K, so total compute differs).

The gap between D=1D{=}1 and N=6N{=}6 shrinks with block size (+1.19+1.19 at 256, +0.46+0.46 at 512, +0.31+0.31 at 1024 with K=5K{=}5), consistent with longer contexts providing more sequential steps. Full block-size scaling in Appendix C.5.

5.6 Pointer Chasing: Depth Separation

Table 5: Pointer chasing (10-hop, 5 keys, 10 values, windowed attention). NN-layer transformers exhibit a staircase-like depth dependence: deeper models solve more levels. The context-ready D=1D{=}1 model (BPTT) solves all levels.
Model Levels solved Iters
Roformer N=1N{=}1 1 / 11 50K
Roformer N=3N{=}3 3 / 11 50K
Roformer N=5N{=}5 6 / 11 50K
Roformer N=10N{=}10 7 / 11 50K
Roformer N=11N{=}11 8 / 11 50K
Roformer N=12N{=}12 11 / 11 50K
D=1 context-ready (BPTT) 11 / 11 {\sim}16K

An NN-layer transformer can compose at most NN sequential reasoning steps in a single forward pass (Merrill and Sabharwal, 2023). To test whether the context-ready architecture can exceed this depth limit, we design a pointer-chasing task. The input contains a base table that maps keys to values, followed by H=10H{=}10 levels of index tables, each of which maps new keys to keys at the previous level. Answering a query at level \ell therefore requires chaining \ell sequential lookups through the tables, and we use windowed causal attention (window =38=38) to prevent the model from bypassing these chains by attending directly to the base table. A full task specification with a worked example is given in Appendix B.

Table 5 shows a staircase-like depth dependence: deeper transformers solve more levels, while shallow transformers fail well before full depth. The context-ready D=1D{=}1 model, trained with BPTT to exploit the full sequential depth of the recurrent correction chain, solves all 11 levels in 16{\sim}16K iterations and scales to 20 hops (all 21 levels).

5.7 Fine-Tuning from Pretrained

Any standard transformer can be converted to context-ready by adding a zero-initialized correction FFN and fine-tuning. To isolate the effect of conversion from additional training, we compare against a same-iteration control: the original roformer trained for the same total number of iterations without conversion (per-iteration compute differs because fine-tuning runs the block KK times). At C=1408C{=}1408, N=12N{=}12 reaches 29.92 PPL at 200K iterations; continued training to 400K yields 27.20. Converting at 200K and fine-tuning to 400K total iterations yields 26.14—a gain of 1.06-1.06 PPL over the continued-training baseline at matched iterations. At C=1024C{=}1024, converting N=24N{=}24 to D=24D{=}24 improves from 29.42 to 28.99 (0.43-0.43) in 18K fine-tuning iterations. The zero-initialized correction ensures no disruption at conversion (the model is function-preserving). During fine-tuning, D=12D{=}12 sees a transient +0.11+0.11 PPL increase before recovering, while D=24D{=}24 shows no transient increase.

5.8 Ablations and Diagnostics

We compare against alternative architectures that attempt the same goal. Among the variants tested, the gain depends on a dedicated correction network with its own weights.

Convergence and sequential exactness. Table 6 shows the full depth progression. Convergence is geometric: at D=1D{=}1 (block size 1024), K=2K{=}2 closes 91% of the K=1K{=}1-to-K=5K{=}5 gap and K=3K{=}3 closes 98%, consistent with Theorem 3. Sequential K=1K{=}1 matches parallel K=10K{=}10 to within 0.01 PPL at every configuration, confirming Theorem 2. The correction contribution (“Corr.” column) grows as DD shrinks: 38–55 PPL at D=1D{=}1 vs. 3.9 at D=23D{=}23.

Table 6: PPL at parallel K=1,2,3,5,10K{=}1{,}2{,}3{,}5{,}10 and sequential K=1K{=}1 (OWT). D=1D{=}1 at C=2048C{=}2048; D=5,12,23D{=}5{,}12{,}23 at C=1024C{=}1024, block size 256. The “Corr.” column measures how much the correction mechanism contributes: the PPL difference between K=1K{=}1 (no correction) and K=5K{=}5 (converged).
DD bs K=1K{=}1 K=2K{=}2 K=3K{=}3 K=5K{=}5 K=10K{=}10 Seq Corr.
1 256 70.80 36.21 33.23 32.70 32.79 32.80 38.1
1 512 73.22 34.33 31.24 30.61 30.68 30.69 42.6
1 1024 84.13 34.42 30.39 29.51 29.43 29.43 54.7
5 256 58.17 40.18 38.80 38.60 38.61 38.62 19.6
12 256 38.33 32.58 32.30 32.29 32.29 32.29 6.0
23 256 32.80 29.03 28.89 28.88 28.88 28.88 3.9

Contraction. With kmin=2k_{\min}=2, empirical L^[0.50,0.72]\hat{L}\in[0.50,0.72] (measured as L^=maxtcK,tcK1,t/cK1,tcK2,t\hat{L}=\max_{t}\|c_{K,t}-c_{K-1,t}\|/\|c_{K-1,t}-c_{K-2,t}\|); without kmink_{\min}, L^[0.88,1.20]\hat{L}\in[0.88,1.20]. L^\hat{L} is a trajectory-local diagnostic, not the global LL in Theorem 3.

Correction FFN is essential. Using the block’s own residual as the correction (“block_head”) gives no improvement (27.32 vs. 27.19 for a standard transformer). With a dedicated correction FFN, context-ready beats the FLOP-matched baseline by 1.8 PPL. Tying correction FFN weights to the block FFN collapses performance. The add variant (LN(zt1+et)\texttt{LN}(z_{t-1}+e_{t})) matches or beats token-blind at all depths.

Training efficiency. K=5K{=}5 suffices at D5D\geq 5; K=2K{=}2 with torch.compile achieves 1.7×1.7\times faster training at 1.07 PPL cost. Full ablation tables in Appendix C.2.

6 Conclusion

A standard transformer assigns each new token a context-free embedding and relies entirely on depth to contextualize it. The context-ready transformer shortcuts this process: a correction derived from the previous position’s block output pre-contextualizes the token before it enters the block. Two structural choices—non-cumulative correction and past-only contextualization—make this exact at streaming inference (Theorem 2) and trainable with KK unrolling steps (each processing all TT positions in parallel) rather than full BPTT.

A D=5D{=}5 model beats a 12-layer transformer while generating 1.7×1.7\times faster; D=1D{=}1 beats a 6-layer transformer with a 2.6×2.6\times speedup, and sequential inference matches parallel K=10K{=}10 to within 0.01 PPL. The advantage grows with width and context length. Fewer layers also reduce KV cache memory. On pointer chasing, D=1D{=}1 solves all composition levels that standard transformers need proportional depth to reach. Any pretrained transformer can be converted by adding a zero-initialized correction FFN and fine-tuning.

Limitations. Datasets and scale. Results are on OpenWebText, Wikipedia, and a synthetic task. We have validated at 110–150M and 375–410M FLOPs/token but not yet on standard benchmarks or at billion-parameter scale. Training cost. From-scratch training runs the block KK times per iteration rather than once. Backpropagation through KK steps also requires storing activations for all KK passes, scaling activation memory by K×K\times. Several approaches can reduce this cost: (i) pretrain as a standard transformer at K=1K{=}1 cost, then convert and fine-tune—in our experiments this matches or exceeds from-scratch context-ready training (Sections 5.5 and 5.7); (ii) random-depth training (kmink_{\min}), which samples KK each batch and converges to the same quality as fixed KK (Section 5.5). Prefill. Processing a prompt of length TT in parallel requires KK unrolling steps, giving effective prefill depth K×DK\times D vs. NN for a standard transformer.

References

  • Bai et al. (2019) Shaojie Bai, J. Zico Kolter, and Vladlen Koltun. Deep equilibrium models. In Advances in Neural Information Processing Systems, 2019.
  • Bai et al. (2021) Shaojie Bai, Vladlen Koltun, and J. Zico Kolter. Accelerating feedforward computation via parallel nonlinear equation solving. In International Conference on Machine Learning, 2021.
  • Banino et al. (2021) Andrea Banino, Jan Balaguer, and Charles Blundell. PonderNet: Learning to ponder. In ICML Workshop on Uncertainty and Robustness in Deep Learning, 2021.
  • Beck et al. (2024) Maximilian Beck, Korbinian Pöppel, Markus Spanring, Andreas Auer, Oleksandra Prudnikova, Michael Kopp, et al. xLSTM: Extended long short-term memory. arXiv preprint arXiv:2405.04517, 2024.
  • De et al. (2024) Soham De, Samuel L. Smith, Anushan Fernando, Aleksandar Botev, et al. Griffin: Mixing gated linear recurrences with local attention for efficient language models. arXiv preprint arXiv:2402.19427, 2024.
  • Dehghani et al. (2019) Mostafa Dehghani, Stephan Gouws, Oriol Vinyals, Jakob Uszkoreit, and Łukasz Kaiser. Universal transformers. In International Conference on Learning Representations, 2019.
  • Elhoushi et al. (2024) Mostafa Elhoushi, Akshat Shrivastava, Diana Liskovich, Basil Hosmer, Bram Wasti, Liangzhen Lai, Anas Mahmoud, Bilge Acber, Saurabh Agarwal, Ahmed Roman, et al. LayerSkip: Enabling early exit inference and self-speculative decoding. arXiv preprint arXiv:2404.16710, 2024.
  • Fu et al. (2024) Yichao Fu, Peter Bailis, Ion Stoica, and Hao Zhang. Break the sequential dependency of LLM inference using lookahead decoding. arXiv preprint arXiv:2402.02057, 2024.
  • Geiping et al. (2025) Jonas Geiping, Tom Goldstein, Avi Schwarzschild, C. Bayan Bruss, et al. Scaling up test-time compute with latent reasoning: A recurrent depth approach. arXiv preprint arXiv:2502.05171, 2025.
  • Gokaslan and Cohen (2019) Aaron Gokaslan and Vanya Cohen. Openwebtext corpus. http://Skylion007.github.io/OpenWebTextCorpus, 2019.
  • Graves (2016) Alex Graves. Adaptive computation time for recurrent neural networks. arXiv preprint arXiv:1603.08983, 2016.
  • Gu and Dao (2024) Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces. In Proceedings of ICML 2024, 2024.
  • Kou et al. (2024) Siqi Kou, Lanxiang Hu, Zhezhi He, Zhijie Deng, and Hao Zhang. CLLMs: Consistency large language models. arXiv preprint arXiv:2403.00835, 2024.
  • Lan et al. (2020) Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. ALBERT: A lite BERT for self-supervised learning of language representations. In International Conference on Learning Representations, 2020.
  • Loshchilov and Hutter (2019) Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. In International Conference on Learning Representations, 2019.
  • Merrill and Sabharwal (2023) William Merrill and Ashish Sabharwal. The parallelism tradeoff: Limitations of log-precision transformers. In Transactions of the Association for Computational Linguistics, volume 11, pages 531–545, 2023.
  • Peng et al. (2023) Bo Peng, Eric Alcaide, Quentin Anthony, Alon Albalak, Samuel Arcadinho, Huanqi Cao, Xin Cheng, Michael Chung, et al. RWKV: Reinventing RNNs for the transformer era. In Findings of EMNLP 2023, 2023.
  • Siegelmann and Sontag (1995a) Hava T. Siegelmann and Eduardo D. Sontag. Computational capabilities of recurrent NARX neural networks. IEEE Transactions on Systems, Man, and Cybernetics, 26(4):535–544, 1995a.
  • Siegelmann and Sontag (1995b) Hava T. Siegelmann and Eduardo D. Sontag. On the computational power of neural nets. Journal of Computer and System Sciences, 50(1):132–150, 1995b.
  • Su et al. (2024) Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. RoFormer: Enhanced transformer with rotary position embedding. Neurocomputing, 568:127063, 2024.
  • Yoo et al. (2026) Seunghyun Yoo et al. ADEPT: Adaptive dynamic early-exit process for transformers. arXiv preprint arXiv:2601.03700, 2026.

NeurIPS Paper Checklist

  1. 1.

    Claims

  2. Answer: [Yes]

  3. Justification: The abstract and introduction state the architectural contributions and experimental results with specific numbers. Limitations (datasets, scale, training cost) are discussed explicitly in Section 6.

  4. 2.

    Limitations

  5. Answer: [Yes]

  6. Justification: Section 6 includes a dedicated Limitations paragraph covering dataset scope, scale, training cost, and single-run reporting.

  7. 3.

    Theory assumptions and proofs

  8. Answer: [Yes]

  9. Justification: All theorems and propositions are numbered and cross-referenced. Assumptions are stated explicitly in the appendix proofs (Appendix A.1, A.2, A.3, A.4). Main-body statements reference the appendix for full proofs.

  10. 4.

    Experimental result reproducibility

  11. Answer: [Yes]

  12. Justification: The architecture is fully described in Section 3. Hyperparameters, optimizer settings, learning rate schedules, and training details are provided in Appendix C. All experiments use publicly available data (OpenWebText, Wikipedia).

  13. 5.

    Open access to data and code

  14. Answer: [No]

  15. Justification: Code is not released at submission time due to patent considerations. The architecture and training procedure are described in sufficient detail to reproduce.

  16. 6.

    Experimental setting/details

  17. Answer: [Yes]

  18. Justification: Section 5.1 describes data, tokenization, context lengths, optimizer, and baselines. Appendix C provides full hyperparameter tables.

  19. 7.

    Experiment statistical significance

  20. Answer: [No]

  21. Justification: All results are single runs. We acknowledge this explicitly and report trends across multiple configurations (widths, depths, block sizes) rather than relying on individual comparisons.

  22. 8.

    Experiments compute resources

  23. Answer: [Yes]

  24. Justification: Section 5.1 reports FLOPs per token and training iterations. Inference timing is measured on a single A100 GPU (Appendix C.9).

  25. 9.

    Code of ethics

  26. Answer: [Yes]

  27. Justification: The research conforms with the NeurIPS Code of Ethics. No human subjects, private data, or dual-use concerns.

  28. 10.

    Broader impacts

  29. Answer: [N/A]

  30. Justification: This is foundational architecture research. The work reduces inference cost for language models, which has broadly positive efficiency implications but no direct path to specific negative applications beyond those inherent to language modeling in general.

  31. 11.

    Safeguards

  32. Answer: [N/A]

  33. Justification: No pretrained language models or scraped datasets are released.

  34. 12.

    Licenses for existing assets

  35. Answer: [Yes]

  36. Justification: OpenWebText is cited [Gokaslan and Cohen, 2019]. Wikipedia is publicly available. All referenced works are cited.

  37. 13.

    New assets

  38. Answer: [N/A]

  39. Justification: No new datasets, models, or code are released with this submission.

  40. 14.

    Crowdsourcing and research with human subjects

  41. Answer: [N/A]

  42. Justification: No crowdsourcing or human subjects research.

  43. 15.

    Institutional review board (IRB) approvals or equivalent for research with human subjects

  44. Answer: [N/A]

  45. Justification: No human subjects research.

  46. 16.

    Declaration of LLM usage

  47. Answer: [N/A]

  48. Justification: LLMs were not used as a component of the core methodology.

Appendix A Full Proofs

Notation. Throughout the appendix, ee denotes the base token embeddings, GG denotes the full correction operator, and c(k)c^{(k)} denotes the correction vector at iteration kk. The fixed point is c=G(c)c^{*}=G(c^{*}).

A.1 Proof of Theorem 1 (Structural Characterization)

Formal statement.

Setup. Let etde_{t}\in\mathbb{R}^{d} denote the base embedding at position tt, let Block denote a DD-layer transformer block with causal attention, and let F:d×ddF:\mathbb{R}^{d}\times\mathbb{R}^{d}\to\mathbb{R}^{d} be a correction function. The architecture processes position tt as follows: form the corrected input x~t=et+F(et,zt1)\tilde{x}_{t}=e_{t}+F(e_{t},z_{t-1}), compute zt=Block(x~t;x~<t)z_{t}=\texttt{Block}(\tilde{x}_{t};\,\tilde{x}_{<t}), and cache ztz_{t} for future positions.

Assumptions.

  1. (I)

    Additive correction. The corrected input has the form x~t=et+F(et,zt1)\tilde{x}_{t}=e_{t}+F(e_{t},z_{t-1}), where FF is a continuous correction function that maps into a bounded ball B¯(0,B)d\overline{B}(0,B)\subset\mathbb{R}^{d}.

  2. (II)

    Single-pass streaming. During inference, tokens are processed left-to-right, one at a time. The correction at position tt uses the cached block output zt1z_{t-1} from the previous position and the current embedding ete_{t}. By causality, zt1z_{t-1} depends only on positions t1\leq t{-}1.

Conclusions.

  1. (a)

    Past-only. By streaming (Property II), the correction at position tt depends only on previously computed outputs: F(et,zt1)F(e_{t},z_{t-1}) where zt1z_{t-1} encodes positions 0,,t10,\ldots,t{-}1.

  2. (b)

    Non-cumulative. Among additive unrolling strategies, only the non-cumulative form x~t(k)=et+F(et,zt1(k))\tilde{x}_{t}^{(k)}=e_{t}+F(e_{t},z_{t-1}^{(k)}) is compatible with streaming. The cumulative (resnet) alternative x~t(k)=x~t(k1)+F(x~t(k1),zt1(k))\tilde{x}_{t}^{(k)}=\tilde{x}_{t}^{(k-1)}+F(\tilde{x}_{t}^{(k-1)},z_{t-1}^{(k)}) fails because F(x~t,zt1)=0F(\tilde{x}_{t}^{*},z_{t-1}^{*})=0 at the fixed point.

  3. (c)

    Existence and uniqueness. The non-cumulative past-only system has a unique fixed point, and streaming computes it exactly.

Proof.

Part (a): Past-only. Streaming (Property II) processes tokens left-to-right. When token tt arrives, the correction uses zt1z_{t-1} (cached from the previous token) and ete_{t}. Since causal attention ensures zt1z_{t-1} depends only on positions t1\leq t{-}1, the correction at position tt is a function of past outputs only.

Part (b): Non-cumulative. Given the additive correction form (Property I), there are two ways to unroll KK training steps:

Non-cumulative: x~t(k)=et+F(et,zt1(k))\tilde{x}_{t}^{(k)}=e_{t}+F(e_{t},z_{t-1}^{(k)}). Each step recomputes the correction from the base embedding ete_{t}. At the fixed point, F(et,zt1)=ctF(e_{t},z_{t-1}^{*})=c_{t}^{*}, a nonzero correction. In streaming, past outputs are exact (from cache), so a single evaluation gives x~t=et+F(et,zt1)=et+ct=x~t\tilde{x}_{t}=e_{t}+F(e_{t},z_{t-1}^{*})=e_{t}+c_{t}^{*}=\tilde{x}_{t}^{*}. Exact.

Cumulative (resnet): x~t(k)=x~t(k1)+F(x~t(k1),zt1(k))\tilde{x}_{t}^{(k)}=\tilde{x}_{t}^{(k-1)}+F(\tilde{x}_{t}^{(k-1)},z_{t-1}^{(k)}). Each step adds an increment to the previous output. At the fixed point, x~t=x~t+F(x~t,zt1)\tilde{x}_{t}^{*}=\tilde{x}_{t}^{*}+F(\tilde{x}_{t}^{*},z_{t-1}^{*}), which forces F(x~t,zt1)=0F(\tilde{x}_{t}^{*},z_{t-1}^{*})=0: the correction function learns to output zero at convergence. In streaming, the single step starts from x~t(0)=et\tilde{x}_{t}^{(0)}=e_{t} and gives x~t=et+F(et,zt1)\tilde{x}_{t}=e_{t}+F(e_{t},z_{t-1}^{*}). But FF was trained to vanish at x~t\tilde{x}_{t}^{*}, not at ete_{t}, so the result is neither zero nor the correct fixed point x~t\tilde{x}_{t}^{*}.

Therefore, among additive unrolling strategies, the non-cumulative form is the one that reproduces the correct fixed-point output during streaming.

Part (c): Existence and uniqueness. By part (a), the system is triangular: x~1\tilde{x}_{1}^{*} is determined first (from z0=𝟎z_{0}=\mathbf{0} and e1e_{1}), then z1z_{1}^{*}, then x~2\tilde{x}_{2}^{*}, and so on. Each step is a deterministic evaluation, so the fixed point exists, is unique, and is constructively computable by left-to-right evaluation—exactly the streaming computation. ∎

A.2 Proof of Theorem 2 (Exact Streaming)

Formal statement. Let c(T)c^{*(T)} be the unique fixed point for a sequence of length TT. Assume prefix consistency: for any T>TT^{\prime}>T and tTt\leq T, the correction operator satisfies Gt(T)=Gt(T)G_{t}^{(T^{\prime})}=G_{t}^{(T)}, i.e., appending tokens beyond position TT does not change the operator at earlier positions. (This holds by construction for any causal architecture where the operator at position tt depends only on positions t\leq t.) Then:

  1. (a)

    Prefix invariance. ct(T)=ct(T)c^{*(T^{\prime})}_{t}=c^{*(T)}_{t} for tTt\leq T and any T>TT^{\prime}>T.

  2. (b)

    Exactness. If past corrections are at the fixed point, the streaming operator produces the exact fixed-point correction for the new token.

  3. (c)

    No contraction needed. Exactness holds without any contraction assumption.

The following lemma establishes that the Jacobi iteration converges in finitely many steps, which is the foundation for the streaming exactness proof.

Lemma 1 (Finite-step exactness).

Let GG be a past-only correction operator. Then: (a) The fixed point cc^{*} exists and is unique, without contraction. (b) After kk Jacobi iterations from any c(0)c^{(0)}: ct(k)=ctc_{t}^{(k)}=c_{t}^{*} for all tkt\leq k. (c) The iteration reaches the exact fixed point after at most TT steps: c(T)=cc^{(T)}=c^{*}.

Proof.

(a) By construction: G1G_{1} is a constant, so c1c_{1}^{*} is determined. Given c1,,ct1c_{1}^{*},\ldots,c_{t-1}^{*}, we have ct=Gt(c1,,ct1)c_{t}^{*}=G_{t}(c_{1}^{*},\ldots,c_{t-1}^{*}) uniquely.

(b) By induction. Base: c1(1)=G1=c1c_{1}^{(1)}=G_{1}=c_{1}^{*}. Step: assume cs(k)=csc_{s}^{(k)}=c_{s}^{*} for sks\leq k. Then ck+1(k+1)=Gk+1(c1(k),,ck(k))=Gk+1(c1,,ck)=ck+1c_{k+1}^{(k+1)}=G_{k+1}(c_{1}^{(k)},\ldots,c_{k}^{(k)})=G_{k+1}(c_{1}^{*},\ldots,c_{k}^{*})=c_{k+1}^{*}.

(c) Set k=Tk=T in (b). ∎

Proof of Theorem 2.

(a) By prefix consistency, Gt(T)=Gt(T)G_{t}^{(T^{\prime})}=G_{t}^{(T)} for tTt\leq T. Hence the fixed-point equations for positions 1,,T1,\ldots,T are identical in the length-TT and length-TT^{\prime} systems, so ct(T)=ct(T)c^{*(T^{\prime})}_{t}=c^{*(T)}_{t}.

(b) cT+1(T+1)=GT+1(T+1)(c1(T+1),,cT(T+1))c^{*(T+1)}_{T+1}=G^{(T+1)}_{T+1}(c^{*(T+1)}_{1},\ldots,c^{*(T+1)}_{T}). By (a), ct(T+1)=ct(T)c^{*(T+1)}_{t}=c^{*(T)}_{t} for tTt\leq T. The streaming operator computes exactly GT+1(T+1)G^{(T+1)}_{T+1} using cached corrections c(T)c^{*(T)}. Concretely, the abstract operator GtG_{t} is realized by the correction FFN applied to the cached block output zt1z_{t-1} and the current token embedding ete_{t}: once earlier positions are at their fixed-point values, zt1z_{t-1} is fully determined, and the streaming step computes ct=Gt(c1,,ct1;et)c_{t}=G_{t}(c_{1}^{*},\ldots,c_{t-1}^{*};e_{t}) exactly.

(c) By induction from the base case and part (b). ∎

A.3 Proof of Theorem 3 (Convergence)

Formal statement. Let Gt(c<t;e)G_{t}(c_{<t};e) denote the full correction operator at position tt. Assume G(c;e)G(c;e)Lcc+Mee\|G(c;e)-G(c^{\prime};e^{\prime})\|\leq L\|c-c^{\prime}\|+M\|e-e^{\prime}\| for all c,c,e,ec,c^{\prime},e,e^{\prime}. If L<1L<1, then:

  1. (a)

    Contraction. c(k)cLkc(0)c\|c^{(k)}-c^{*}\|\leq L^{k}\|c^{(0)}-c^{*}\|.

  2. (b)

    Warm-start bound. G(c;e)cLM1Lee\|G(c^{*};e^{\prime})-c^{*\prime}\|\leq\frac{LM}{1-L}\|e^{\prime}-e\|.

Proof.

Part (a). By the Banach fixed-point theorem with contraction constant L<1L<1.

Part (b). ccMee+Lcc\|c^{*}-c^{*\prime}\|\leq M\|e-e^{\prime}\|+L\|c^{*}-c^{*\prime}\|, so ccM1Lee\|c^{*}-c^{*\prime}\|\leq\frac{M}{1-L}\|e^{\prime}-e\|. Then G(c;e)cLccLM1Lee\|G(c^{*};e^{\prime})-c^{*\prime}\|\leq L\|c^{*}-c^{*\prime}\|\leq\frac{LM}{1-L}\|e^{\prime}-e\|. ∎

Theorem 3 assumes L<1L<1 but does not say how to verify this from the per-position Jacobian structure. The following lemma provides a practical bound: given bounds on the partial derivatives Gt/cs\|\partial G_{t}/\partial c_{s}\|, one can choose a weighted norm that makes the global Lipschitz constant explicit.

Lemma 2 (Causal contraction bound).

Let G(c;e)G(c;e) be the full-sequence correction operator with past-only dependencies, and suppose Gt/csopat,s\|\partial G_{t}/\partial c_{s}\|_{\textup{op}}\leq a_{t,s} for s<ts<t, where Aop=supx=1Ax\|A\|_{\textup{op}}=\sup_{\|x\|=1}\|Ax\| is the operator norm. For positive weights w=(w1,,wT)w=(w_{1},\ldots,w_{T}), define cw=maxtwtct\|c\|_{w}=\max_{t}w_{t}\|c_{t}\|. Then GG is LwL_{w}-Lipschitz in w\|\cdot\|_{w} with constant:

Lw=max1tTwts=1t1at,sws.L_{w}=\max_{1\leq t\leq T}w_{t}\sum_{s=1}^{t-1}\frac{a_{t,s}}{w_{s}}.

In particular, if Lw<1L_{w}<1, then GG is a contraction.

Proof.

For each tt: Gt(c)Gt(c)s<tat,scscss<tat,swsccw\|G_{t}(c)-G_{t}(c^{\prime})\|\leq\sum_{s<t}a_{t,s}\|c_{s}-c_{s}^{\prime}\|\leq\sum_{s<t}\frac{a_{t,s}}{w_{s}}\|c-c^{\prime}\|_{w}. Multiplying by wtw_{t} and taking the maximum over tt gives G(c)G(c)wLwccw\|G(c)-G(c^{\prime})\|_{w}\leq L_{w}\|c-c^{\prime}\|_{w}. ∎

When KK is chosen at training time, one may want to know how close c(K)c^{(K)} is to cc^{*} without computing cc^{*}. The next lemma gives a computable bound using only consecutive iterates.

Lemma 3 (A posteriori error bound).

If L<1L<1, then after kk iterations: c(k)cL1Lc(k)c(k1)\|c^{(k)}-c^{*}\|\leq\frac{L}{1-L}\|c^{(k)}-c^{(k-1)}\|.

Proof.

By the triangle inequality, c(k)cj=1c(k+j)c(k+j1)j=1Ljc(k)c(k1)=L1Lc(k)c(k1)\|c^{(k)}-c^{*}\|\leq\sum_{j=1}^{\infty}\|c^{(k+j)}-c^{(k+j-1)}\|\leq\sum_{j=1}^{\infty}L^{j}\|c^{(k)}-c^{(k-1)}\|=\frac{L}{1-L}\|c^{(k)}-c^{(k-1)}\|. ∎

Theorem 3(a) gives a global contraction rate LkL^{k} that treats all positions uniformly, but this is overly pessimistic. Because GG is past-only, its Jacobian is strictly lower-triangular and therefore nilpotent, so position tt reaches its exact fixed point after at most tt iterations rather than TT. The following lemma exploits this structure to give a tighter, position-dependent error bound via causal path sums.

Lemma 4 (Finite-depth error bound).

Let GG be a past-only correction operator with Jacobian bounds at,sa_{t,s}. Let AA be the strictly lower-triangular matrix with entries at,sa_{t,s}, and B=maxtGt(0)B=\max_{t}\|G_{t}(0)\|. After NN Jacobi iterations from c(0)=0c^{(0)}=0:

ct(N)ctk=NT1[Ak]t,𝟏B=[(AN(IA)1)t,]𝟏B\|c_{t}^{(N)}-c_{t}^{*}\|\leq\sum_{k=N}^{T-1}[A^{k}]_{t,\cdot}\mathbf{1}\cdot B=[(A^{N}(I-A)^{-1})_{t,\cdot}]\mathbf{1}\cdot B
Proof.

By the integral form of the mean value theorem, the error e(k)=c(k)ce^{(k)}=c^{(k)}-c^{*} satisfies et(k+1)=s<tJt,s(k)es(k)e^{(k+1)}_{t}=\sum_{s<t}J_{t,s}^{(k)}e^{(k)}_{s} where Jt,s(k)at,s\|J_{t,s}^{(k)}\|\leq a_{t,s} by the Jacobian bound assumption. Bounding by the entrywise matrix AA and iterating from e(0)=ce^{(0)}=-c^{*} gives et(N)[AN|c|]t\|e^{(N)}_{t}\|\leq[A^{N}|c^{*}|]_{t}. To bound |c||c^{*}|: the fixed point satisfies ct=Gt(c<t)c_{t}^{*}=G_{t}(c_{<t}^{*}), so ctGt(0)+s<tat,scs\|c_{t}^{*}\|\leq\|G_{t}(0)\|+\sum_{s<t}a_{t,s}\|c_{s}^{*}\|, i.e., |c|B𝟏+A|c||c^{*}|\leq B\cdot\mathbf{1}+A|c^{*}|, hence |c|(IA)1B𝟏|c^{*}|\leq(I-A)^{-1}B\cdot\mathbf{1} (well-defined since AA is nilpotent). Substituting: et(N)[AN(IA)1]t,𝟏B\|e^{(N)}_{t}\|\leq[A^{N}(I-A)^{-1}]_{t,\cdot}\mathbf{1}\cdot B. ∎

A.4 Proof of Proposition 1 (Depth Separation)

Formal statement.

Setup. Suppose the data is generated by a process with state update 𝐬t=h(𝐬t1,𝐞t)\mathbf{s}_{t}=h^{*}(\mathbf{s}_{t-1},\mathbf{e}_{t}), where 𝐬tm\mathbf{s}_{t}\in\mathbb{R}^{m} is the process state and 𝐞t=(𝐞tW+1,,𝐞t)CW\mathbf{e}_{t}=(\mathbf{e}_{t-W+1},\ldots,\mathbf{e}_{t})\in\mathbb{R}^{CW} collects the token embeddings in the current window. The context-ready architecture (Equations 13.1) with attention window WW maintains a state 𝐛^tnW\hat{\mathbf{b}}_{t}\in\mathbb{R}^{nW} over the full window of WW positions, where nn is the per-position state dimension. The state evolves via 𝐛^t=h(𝐛^t1,𝐞t)\hat{\mathbf{b}}_{t}=h(\hat{\mathbf{b}}_{t-1},\mathbf{e}_{t}), where h:nW×CWnWh:\mathbb{R}^{nW}\times\mathbb{R}^{CW}\to\mathbb{R}^{nW} composes the correction FFN and block (Equations 13.1). Let LhL_{h} be the Lipschitz constant of hh in its first argument.

To compare these two systems, let π:mnW\pi:\mathbb{R}^{m}\to\mathbb{R}^{nW} be a map from the process state to the architecture’s state space. The architecture faithfully tracks the process when the following diagram commutes: advancing the process state by hh^{*} and then projecting gives the same result as projecting and then advancing by hh. The commutation error

εD=sup𝐬,𝐞h(π(𝐬),𝐞)π(h(𝐬,𝐞))\varepsilon_{D}\;=\;\sup_{\mathbf{s},\,\mathbf{e}}\;\bigl\|h\bigl(\pi(\mathbf{s}),\,\mathbf{e}\bigr)-\pi\bigl(h^{*}(\mathbf{s},\,\mathbf{e})\bigr)\bigr\|

measures how far the diagram is from commuting: it captures both the block’s finite-depth approximation error and any information lost when the two state spaces differ.

Assumptions.

  1. (i)

    Lh<1L_{h}<1 and εD<\varepsilon_{D}<\infty.

  2. (ii)

    Prediction sufficiency. π\pi preserves prediction-relevant information: p(xt+1𝐬t)=p(xt+1π(𝐬t))p(x_{t+1}\mid\mathbf{s}_{t})=p(x_{t+1}\mid\pi(\mathbf{s}_{t})).

  3. (iii)

    Lipschitz readout. The prediction function ϕ:nWΔ\phi:\mathbb{R}^{nW}\to\Delta is LϕL_{\phi}-Lipschitz.

Conclusions.

  1. (a)

    Context-ready error bound. The accumulated state error satisfies 𝐛^tπ(𝐬t)εD/(1Lh)\|\hat{\mathbf{b}}_{t}-\pi(\mathbf{s}_{t})\|\leq\varepsilon_{D}/(1-L_{h}) uniformly in tt. By prediction sufficiency and Lipschitz readout, the prediction error is bounded by LϕεD/(1Lh)L_{\phi}\varepsilon_{D}/(1-L_{h}). If εD=0\varepsilon_{D}=0, then streaming is exact for all tt.

  2. (b)

    Standard transformer receptive-field bound. Let NN be the number of layers in a standard transformer with attention window WW. Then position tt has no computational path to any position before tNWt-NW, so the transformer cannot represent any function whose output at position tt depends on inputs before tNWt-NW.

Proof.

Part (a). At step tt, the architecture computes 𝐛^t=h(𝐛^t1,𝐞t)\hat{\mathbf{b}}_{t}=h(\hat{\mathbf{b}}_{t-1},\mathbf{e}_{t}) while the projected true state satisfies π(𝐬t)=π(h(𝐬t1,𝐞t))\pi(\mathbf{s}_{t})=\pi(h^{*}(\mathbf{s}_{t-1},\mathbf{e}_{t})). By the triangle inequality:

𝐛^tπ(𝐬t)h(𝐛^t1,𝐞t)h(π(𝐬t1),𝐞t)Lh𝐛^t1π(𝐬t1)+h(π(𝐬t1),𝐞t)π(h(𝐬t1,𝐞t))εD.\|\hat{\mathbf{b}}_{t}-\pi(\mathbf{s}_{t})\|\;\leq\;\underbrace{\|h(\hat{\mathbf{b}}_{t-1},\mathbf{e}_{t})-h(\pi(\mathbf{s}_{t-1}),\mathbf{e}_{t})\|}_{\leq\;L_{h}\|\hat{\mathbf{b}}_{t-1}-\pi(\mathbf{s}_{t-1})\|}\;+\;\underbrace{\|h(\pi(\mathbf{s}_{t-1}),\mathbf{e}_{t})-\pi(h^{*}(\mathbf{s}_{t-1},\mathbf{e}_{t}))\|}_{\leq\;\varepsilon_{D}}.

Unrolling with 𝐛^0=π(𝐬0)\hat{\mathbf{b}}_{0}=\pi(\mathbf{s}_{0}) gives 𝐛^tπ(𝐬t)εDj=0t1LhjεD/(1Lh)\|\hat{\mathbf{b}}_{t}-\pi(\mathbf{s}_{t})\|\leq\varepsilon_{D}\sum_{j=0}^{t-1}L_{h}^{j}\leq\varepsilon_{D}/(1-L_{h}).

Part (b). With attention window WW, the output of layer \ell at position tt depends only on positions in [tW,t][t-\ell W,\,t]. After NN layers, position tt has no computational path to any position before tNWt-NW. ∎

Remark (Depth separation).

Parts (a) and (b) together suggest a depth separation: the context-ready architecture propagates context through the correction chain at no additional depth cost, while a standard windowed transformer must allocate layers for propagation. When propagation and local computation cannot be interleaved—as in the pointer-chasing task where each hop requires a separate lookup—the standard transformer needs at least T/W\lceil T/W\rceil additional layers. In general, some layers may serve both roles, so Nmap+T/WN_{\mathrm{map}}+\lceil T/W\rceil is an upper bound on the required depth.

Appendix B Pointer Chasing Details

Motivation. Fixed-depth transformers are confined to TC0\mathrm{TC}^{0} [Merrill and Sabharwal, 2023]: an NN-layer transformer can compose at most NN sequential reasoning steps in a single forward pass. We design a synthetic task that directly tests this depth limit. Answering a query requires chaining a variable number of sequential lookups, so a model that can only perform a fixed number of parallel steps will fail once the required chain length exceeds its depth. The context-ready architecture sidesteps this barrier because its recurrent correction chain provides sequential computation at inference, even with a single block (D=1D{=}1).

Task definition. The pointer-chasing task has HH hops and MM keys per level. The input contains a base table (level 0) mapping MM keys to values, followed by HH index tables (levels 1,,H1,\ldots,H), each mapping MM keys to keys of the previous level via random permutations (bijections). After each table, a query section provides dense targets: a triplet Q key answer for every key at every level defined so far. Resolving a query at level \ell requires \ell sequential lookups.

Worked example (H=2H{=}2, M=3M{=}3, 10 values). The base table maps A\tov3, B\tov0, C\tov8. Index table 1 maps D\toA, E\toB, F\toC. Index table 2 maps G\toD, H\toE, I\toF. The encoding uses reversed triplets (value=key) so that causal attention can see the value to the left of the key:

Level 0 (base table + queries):
Input: v3 = A v0 = B v8 = C || Target: _ _ _ _ _ _ _ _ _ _ Input: Q A v3 Q B v0 Q C v8 || Target: _ v3 _ _ v0 _ _ v8 _ _

Level 1 (index table + queries):
Input: A = D B = E C = F || Target: _ _ _ _ _ _ _ _ _ _ Input: Q D v3 Q E v0 Q F v8 || Target: _ v3 _ _ v0 _ _ v8 _ _

Level 2 (index table + final query):
Input: D = G E = H F = I || Q G Target: _ _ _ _ _ _ _ _ _ _ _ v3

Targets (bold) appear only at key positions in query sections. Level-0 queries are trivial lookups (Q A \to v3). Level-1 queries require one composition (Q D \to A \to v3). Level-2 queries require two compositions (Q G \to D \to A \to v3). The final token is the actual test query with no answer provided in the input. Dense targets at every level are essential: without them, the model cannot learn multi-hop composition even with BPTT.

Each level uses its own key namespace (A, B, C at level 0; D, E, F at level 1; G, H, I at level 2; etc.) to prevent ambiguity. Key ordering within each table is fixed (not shuffled), so the model can exploit positional patterns via RoPE.

Settings. H=10H=10 hops, M=5M=5 keys, 10 values, embedding dimension C=256C=256, 4 attention heads, batch size 64, window size 38, fixed key ordering, per-level key tokens, RoPE attention. Learning rate 1×1041\times 10^{-4}.

Why windowed attention. Without windowed attention, all models—including deep transformers—can directly attend from any query position to the base table, achieving 1/M{\sim}1/M accuracy without genuine composition. Windowed attention (w=38w=38) ensures that higher-level query sections cannot see the base table, forcing the model to chain through intermediate levels. This reveals the true depth-limited structure of fixed-depth transformers.

Wave propagation in BPTT. The D=1D{=}1 model solves levels sequentially: level 0 converges first, then level 1, then 2, and so on. This is visible in the training dynamics (see progression below). The wave pattern is consistent with corrections propagating through the recurrent chain.

BPTT progression. The progression below uses a smaller configuration (C=128C{=}128, lr =1e-3=1\text{e-3}, 20 values) to demonstrate wave propagation at reduced compute:

Iter L0 L1 L2 L3 L4 L5 L6 L7 L8 L9 L10
3K 1.00 0.74 0.27 0.20 0.19 0.18 0.18 0.18 0.17 0.16 0.16
8K 1.00 1.00 0.80 0.59 0.36 0.21 0.21 0.19 0.20 0.18 0.15
13.5K 1.00 1.00 0.99 0.99 0.94 0.83 0.51 0.24 0.20 0.20 0.18
23K 1.00 1.00 1.00 1.00 1.00 0.99 0.99 0.99 0.99 0.98 0.98

20-hop scaling. At C=512C=512 with lr=1e-4\text{lr}=1\text{e-4}, the same D=1D{=}1 architecture solves all 21 levels (20 hops) in 4{\sim}4K iterations, confirming that the recurrent mechanism scales to deeper composition chains.

Appendix C Extended Experimental Details

C.1 Hyperparameters

Table A1: Hyperparameters by experiment.
Hyperparameter 85{\sim}85M 340{\sim}340M Token-matched Width scaling
Block size 256 256 64 64
Batch size 64 32 1024 1024
Learning rate 2×1042\times 10^{-4} 2×1042\times 10^{-4} 2×1042\times 10^{-4} 2×1042\times 10^{-4}
Training iters 100K 200K varies varies
KK / kmink_{\min} 5 / 2 5 / 2 5 / 2 5 / 2
Dropout 0.2 0.2 0.2 0.2
Vocab size 32,000 32,000 32,000 32,000

C.2 Ablations

Table A2: Correction FFN ablation (C=446C=446, Wikipedia, BPE 16k). “Roformer-hFFN” denotes a standard roformer with an extra FFN layer to match the FLOP cost of the correction FFN.
Model FLOPs/tok Val PPL Seq. K=1K{=}1
D=3 corr_ffn K=5K{=}5 44C244C^{2} 23.98 23.96
Roformer-hFFN N=3N{=}3 44C244C^{2} 25.78
D=3 block_head (no corr_ffn) 36C236C^{2} 27.32 28.46
Roformer N=3N{=}3 36C236C^{2} 27.19
Table A3: Token-aware correction variants (C=446C=446, Wikipedia).
DD Variant FLOPs PPL Seq K=1K{=}1 LL
2 corr_ffn (token-blind) 32C232C^{2} 26.68 26.72 0.74
corr_ffn_add 32C232C^{2} 26.09 26.48 0.54
corr_ffn_concat 36C236C^{2} 25.48 25.82 0.54
3 corr_ffn (token-blind) 44C244C^{2} 23.98 23.96
corr_ffn_add 44C244C^{2} 23.79 24.12 0.55
corr_ffn_concat 48C248C^{2} 23.41 23.73 0.74
Table A4: Scale dependence (C=50C=50768768, Wikipedia).
CC Comparison Context-Ready Baseline Δ\Delta
50 D=3 vs. Roformer-hFFN N=3N{=}3 84.3 83.0 +1.3
74 D=3 vs. Roformer-hFFN N=3N{=}3 62.1 61.4 +0.7
446 D=3 vs. Roformer N=4N{=}4 23.79 24.85 -1.06
768 D=3 vs. Roformer N=4N{=}4 18.66 20.05 -1.39

At very small widths (C74C\leq 74), the correction FFN’s overhead outweighs its benefit; the correction advantage emerges at moderate widths (C446C\geq 446) and grows with scale. All main-body claims are based on results at C256C\geq 256.

Table A5: kmink_{\min} ablation (C=50C=50, D=1D{=}1, K=10K=10).
Metric kmin=2k_{\min}=2 No kmink_{\min}
Val PPL (K=10K=10) 84.32 84.16
Seq K=1K=1 84.61 84.19
Parallel K=1K=1 118.35 130.95
Empirical LL 0.72 0.94

C.3 K=10K=10 Training Details

All K=10K{=}10 experiments: block size 1024, lr=2e-4\text{lr}=2\text{e-4}, softmax attention, nhead=16n_{\text{head}}=16, OWT.

Table A6: D=1D{=}1 K=10K{=}10 (fixed, no kmink_{\min}, batch 16) vs. roformer N=6N{=}6 (batch 16), block size 1024.
Iter N=6N{=}6 D=1D{=}1 K=10K{=}10 Gap
40K 43.24 43.83 +0.59
60K 39.21 39.27 +0.06
65K 38.55 38.49 -0.06
80K 36.99 36.38 -0.61
100K 35.37 34.40 -0.97

C.4 Sequential K=1K{=}1 Validation

Full depth progression for all configurations, confirming Theorem 2.

Table A7: D=1D{=}1 C=2048C{=}2048 depth progression across block sizes (OWT, 200K–400K iters).
Block size Iters Par. K=1K{=}1 Par. K=2K{=}2 Par. K=3K{=}3 Par. K=5K{=}5 Par. K=10K{=}10 Seq. K=1K{=}1
256 100K 84.44 43.77 40.52 39.95 40.02 40.03
256 400K 70.80 36.21 33.23 32.70 32.79 32.80
512 100K 81.02 38.25 35.01 34.35 34.41 34.43
512 200K 73.22 34.33 31.24 30.61 30.68 30.69
1024 200K 84.13 34.42 30.39 29.51 29.43 29.43
Table A8: Higher-DD depth progression (C=1024C{=}1024, block size 256, OWT).
DD Par. K=1K{=}1 Par. K=2K{=}2 Par. K=3K{=}3 Par. K=5K{=}5 Par. K=10K{=}10 Seq. K=1K{=}1
5 58.17 40.18 38.80 38.60 38.61 38.62
8 51.60 40.12 39.22 39.10 39.10 39.10
12 38.33 32.58 32.30 32.29 32.29 32.29
23 32.80 29.03 28.89 28.88 28.88 28.88

At higher DD, convergence is faster: D=12D{=}12 and D=23D{=}23 match K=5K{=}5 within 0.01 PPL at K=3K{=}3; D=8D{=}8 is within 0.12 PPL. Parallel K=1K{=}1 ratio to actual quality shrinks with DD (from 2.85×2.85\times at D=1D{=}1 to 1.14×1.14\times at D=23D{=}23), confirming that the correction mechanism accounts for a larger fraction of quality at low DD.

C.5 Block Size Scaling

Longer context lengths give the D=1D{=}1 correction chain more sequential steps to accumulate depth, so the gap to N=6N{=}6 should shrink with block size. Table A9 confirms this: at K=5K{=}5, the gap narrows from +1.19+1.19 at block size 256 to +0.31+0.31 at 1024. With K=10K{=}10 at block size 1024, D=1D{=}1 overtakes N=6N{=}6 entirely (Section 5.5).

Table A9: D=1D{=}1 C=2048C{=}2048 vs. N=6N{=}6 C=1088C{=}1088 gap at 200K iterations across block sizes (OWT).
Block size N=6N{=}6 PPL D=1D{=}1 PPL Gap
256 34.15 35.34 +1.19
512 30.11 30.57 +0.46
1024 29.22 29.53 +0.31

C.6 Token-Matched Training Curves

To isolate the effect of the correction mechanism from FLOP differences, we compare D=xD{=}x context-ready against N=xN{=}x standard transformers at the same embedding dimension (C=1024C=1024, block size 64), so both see the same number of tokens per training iteration. At every depth tested, the context-ready model overtakes the baseline after a crossover point and the gap continues to grow.

Table A10: Token-matched results at C=1024C=1024, block size 64, OWT. Final values at 1,126M tokens.
Comparison N=xN{=}x PPL D=xD{=}x PPL Gap Crossover
D=1D{=}1 vs. N=1N{=}1 114.8 80.4 -34.4 424{\sim}424M
D=2D{=}2 vs. N=2N{=}2 73.7 66.1 -7.6 565{\sim}565M
D=3D{=}3 vs. N=3N{=}3 62.4 59.4 -3.0 835{\sim}835M
D=6D{=}6 vs. N=6N{=}6 53.0 52.2 -0.7 1,032{\sim}1{,}032M

C.7 Fine-Tuning Details

Any pretrained NN-layer transformer can be converted to a D=ND{=}N context-ready model by adding a zero-initialized correction FFN and fine-tuning. The zero initialization ensures no disruption at conversion: the correction is identically zero, so the model behaves exactly as the original transformer. As fine-tuning progresses, the correction FFN learns to exploit cached context, yielding PPL improvements. At D=12D{=}12, fine-tuning causes a transient +0.11+0.11 PPL increase before recovering; at D=24D{=}24, there is no transient increase.

Table A11: Fine-tuning pretrained roformers to context-ready (block size 256, OWT). “Continued baseline” is the roformer trained for the same total iterations without conversion.
Conversion Baseline Fine-tuned Cont. baseline Δ\Delta vs. cont. FT iters
N=12N{=}12 C=1408C{=}1408 \to D=12D{=}12 29.92 26.14 27.20 -1.06 200K
N=24N{=}24 C=1024C{=}1024 \to D=24D{=}24 29.42 28.99 -0.43 18K
N=12N{=}12 C=1024C{=}1024 \to D=12D{=}12 33.41 32.21 -1.20 50K
Gain vs. pre-conversion checkpoint; continued-training control not available.

C.8 Wikipedia Results

For completeness, we include results on English Wikipedia (BPE 16k, context 256, 100K iterations).

Table A12: Context-ready vs. baselines on Wikipedia. FLOPs/tok includes the prediction head (VCVC), which is identical within each width group.
Model FLOPs/tok Val PPL Δ\Delta
C=768C{=}768 D=5 concat 42.5M 16.69
Roformer N=6N{=}6 42.5M 17.95 -1.26
C=446C{=}446 D=6 add 15.9M 20.40
Roformer-hFFN N=6N{=}6 15.9M 21.44 -1.04
D=2 concat 7.2M 25.48
Roformer N=3N{=}3 7.2M 27.19 -1.71

C.9 Inference Timing

Autoregressive generation speed measured on a single A100 GPU over 10,000 tokens with KV caching, batch size 1 (single-sequence generation).

Table A13: Inference latency: context-ready vs. standard transformers (A100, 10K tokens, KV caching).
Model Params tok/s ms/tok Speedup
D=1D{=}1 C=2048C{=}2048 215M 919 1.09 2.6×2.6\times
Roformer N=6N{=}6 C=1088C{=}1088 155M 351 2.85
D=5D{=}5 C=1120C{=}1120 157M 349 2.86 1.7×1.7\times
Roformer N=12N{=}12 C=768C{=}768 134M 201 4.96

Per-token latency is flat across sequence length in both comparisons (<4%<\!4\% growth from T=100T{=}100 to T=10,000T{=}10{,}000), confirming that KV caching amortizes attention cost to O(T)O(T) per token. The context-ready models are faster despite having more parameters, because fewer sequential layers dominate inference latency on modern GPUs. In addition, fewer layers reduce total KV cache memory (C×DC\times D per token): D=1D{=}1 C=2048C{=}2048 uses 3.2×3.2\times less cache than N=6N{=}6 C=1088C{=}1088; D=5D{=}5 C=1120C{=}1120 uses 1.6×1.6\times less than N=12N{=}12 C=768C{=}768.