Analysis: cosine LR schedule vs linear warmdown + combined recipe

Added `LR_SCHEDULE=cosine` option to train_gpt.py. The baseline uses linear warmdown over the last 1200 steps. Cosine annealing over the full run typically gives 0.001-0.003 bpb improvement for free.

The linear warmdown starts cutting LR aggressively once `remaining_ms <= warmdown_ms`. Cosine keeps LR higher through mid-training where learning is fastest, and decays more smoothly.

With wallclock-based stopping: `lr_mul = 0.5 * (1 + cos(pi * elapsed_ms / max_wallclock_ms))`

## Combined recipe for beating baseline

Based on community findings and my implementations, the recommended config for 8xH100 10-min run:

```
USE_SWIGLU=1 QAT_FRACTION=0.1 LR_SCHEDULE=cosine
```

Expected improvements:
- SwiGLU: ~0.004 bpb (confirmed by ablation)
- QAT: ~0.005 bpb at 13k steps (reduces quant gap from 0.007 to ~0.002)
- Cosine LR: ~0.001-0.003 bpb (needs ablation)
- Combined estimate: ~0.010-0.012 bpb → 1.212-1.214 vs baseline 1.2244

All three are orthogonal optimizations. SwiGLU improves the model, QAT improves compression, cosine improves training dynamics.

Analysis: cosine LR schedule vs linear warmdown + combined recipe

0 replies