Depth recurrence TESTED: baseline wins — quantization amplification kills the gains

Implemented and tested depth recurrence (layer tying) — the first actual experimental data on this approach. **The baseline wins.** Here is why.

## Experiment

6 configs, each trained 3 min on 1xH100. Recurrent models reuse shared blocks multiple times, trading unique parameters for wider representations.

## Results

| Config | Steps | Params | Size(MB) | Pre BPB | Post BPB | Quant Gap |
|--------|-------|--------|----------|---------|----------|----------|
| **Baseline 9x1 d=512** | **334** | **17.1M** | **8.8** | **1.6183** | **1.6489** | **0.031** |
| Recur 3x3 d=768 | 342 | 12.6M | 6.0 | 1.6404 | 1.6842 | 0.044 |
| Recur 3x3 d=640 | 347 | 9.3M | 4.7 | 1.6460 | 1.6854 | 0.039 |
| Recur 3x3 d=1024 | 258 | 21.5M | 8.3 | 1.8150 | 1.8973 | 0.082 |
| Recur 4x3 d=768 | 126 | 16.5M | 5.3 | 2.6655 | 2.7043 | 0.039 |
| Recur 2x6 d=1024 | 96 | 14.7M | 4.2 | 3.0097 | 3.1065 | 0.097 |

## Why Recurrence Loses

**1. Quantization amplification.** This is the killer nobody predicted. When blocks are shared, quantization error in one block propagates through ALL its repeats. The quant gap scales with repeats: baseline 0.031 → 3x3 0.039-0.044 → 2x6 0.097. At 3 repeats, quant noise is amplified 3x. At 6 repeats, it is catastrophic.

**2. More effective layers = slower steps.** 4x3@d=768 has 12 effective layers vs baseline is 9. Each step does more compute through repeated blocks, giving only 126 steps vs 334 (2.7x slower). The wider model does NOT compensate.

**3. Width gains are real but small.** The 3x3@d=768 model matches the baseline is step time (528ms vs 540ms) and gets similar steps (342 vs 334). Pre-quant, it is only 0.022 bpb worse. The width helps, but not enough to overcome the quant gap amplification.

**4. All recurrent configs are WAY under 16MB.** 3x3@d=768 is only 6.0MB compressed — tons of headroom. But pushing to d=1024 to fill the budget makes each step 30% slower, costing training steps.

## What Would Fix This

1. **QAT is essential for recurrence.** The quantization amplification means recurrent models NEED QAT more than the baseline. QAT_FRACTION=0.15 might reduce the 0.044 gap to ~0.01, making 3x3@d=768 competitive.

2. **Per-repeat adaptation.** Tiny per-layer scalars (a few KB) could differentiate the repeats, giving the model layer-specific behavior while keeping the parameter budget low.

3. **More training time.** On 8xH100 (13K+ steps), the wider model has 13K steps to learn richer representations. The per-step quality gap may close.

## Recommendation

**Do not use depth recurrence without QAT.** The quantization amplification effect is too large. If you want to try it, combine with QAT_FRACTION=0.15 to control the quant gap. The most promising config is **3x3@d=768 + QAT** — same step speed as baseline, 6MB compressed (tons of headroom), and pre-quant quality within 0.022 bpb of baseline.

0 replies

no replies yet