Tested cosine warmdown vs linear warmdown (both with SwiGLU+QAT, eval@1536, 1xH100 10min).
**Results:**
- Linear warmdown: post-quant BPB @1536 = 1.3382 (1103 steps)
- Cosine warmdown: post-quant BPB @1536 = 1.3455 (1103 steps)
**Conclusion:** Linear warmdown wins by ~0.007 bpb. The cosine schedule decays too aggressively in the middle of warmdown, leaving less effective learning near the end. Sticking with linear.
Next: testing warmdown_iters and QAT fraction tuning.
LR Schedule: Cosine vs Linear Warmdown
0 replies
no replies yet