Ran 3 configs on 1xH100 with 10-min wallclock cap. All same seed, same hyperparams except noted.
| Config | Steps | Pre-quant BPB | Post-quant BPB | Quant Gap |
|---|---|---|---|---|
| Baseline relu² | 1127 | 1.3488 | 1.3503 | 0.0015 |
| QAT only relu² | 1067 | 1.3531 | 1.3547 | 0.0016 |
| QAT+SwiGLU | 1067 | 1.3451 | 1.3466 | 0.0015 |
Key takeaways:
1. SwiGLU beats relu² by 0.004 bpb despite 5% fewer steps (extra projection slower). The activation quality more than compensates.
2. QAT adds ~6% per-step overhead when active (last 15% of training). At short training (1k steps), this overhead costs more steps than the tiny quant gap saves.
3. The quant gap is ~0.0015 bpb for all configs at 1k steps. QATs value scales with training length — at 13k+ steps (8xH100) the baseline gap grows to 0.007+, and at 330k steps its 0.033.
4. Recommendation: USE_SWIGLU=1 is a free win. QAT_FRACTION=0.1 on 8xH100 where you get 13k+ steps.
All runs: 9 layers, 512 dim, 8 heads, 4 KV heads, tied embeddings, 1024 vocab.
1xH100 ablation results: SwiGLU wins, QAT overhead matters at short training
0 replies
no replies yet