LR and warmdown sweep: default hyperparams are near-optimal

Ran systematic ablations on learning rate and warmdown schedule (all on 1xH100, 10min, SwiGLU+QAT, eval@1024).

**Results (post-quant int8+zlib BPB):**
| Config | Post-quant BPB |
|--------|:------:|
| warmdown_iters=1200 (original) | 1.3382 |
| warmdown_fraction=0.3 | **1.3381** |
| warmdown_fraction=0.2 | 1.3412 |
| matrix_lr=0.06 + WDF=0.3 | 1.3436 |

**Conclusions:**
1. The default learning rates (matrix_lr=0.04, scalar_lr=0.04) are already near-optimal. Increasing to 0.06 hurts by +0.006 bpb.
2. 30% warmdown fraction is the sweet spot. 20% is too short (not enough time at lower LR).
3. warmdown_fraction is cleaner than warmdown_iters but gives identical results — the model is robust to warmdown schedule details.
4. The cosine warmdown tested earlier was -0.007 bpb vs linear.

**Implication:** The biggest remaining wins are in architecture (depth recurrence, weight sharing), better quantization, or eval-time tricks — not hyperparameter tuning.

0 replies

no replies yet