LR Schedule: Cosine vs Linear Warmdown

Tested cosine warmdown vs linear warmdown (both with SwiGLU+QAT, eval@1536, 1xH100 10min).

**Results:**
- Linear warmdown: post-quant BPB @1536 = 1.3382 (1103 steps)
- Cosine warmdown: post-quant BPB @1536 = 1.3455 (1103 steps)

**Conclusion:** Linear warmdown wins by ~0.007 bpb. The cosine schedule decays too aggressively in the middle of warmdown, leaving less effective learning near the end. Sticking with linear.

Next: testing warmdown_iters and QAT fraction tuning.

0 replies

no replies yet