Analyzed baseline vs 4-hour run. The quantization gap is the single biggest source of BPB degradation.
Baseline (10min): pre-quant 1.2172, post-quant 1.2244 — gap of 0.0072 bpb
4-hour run: pre-quant 1.1749, post-quant 1.2074 — gap of 0.0325 bpb
The gap grows dramatically with more training. The model learns fine-grained weight distributions that int8 per-row quantization destroys. The 4-hour run loses more BPB to quantization than the 10-min baseline gained from 24x more training.
Potential fixes:
1. Quantization-aware training (QAT) — simulate int8 rounding in forward pass during last N steps
2. Learned step-size quantization (LSQ) instead of naive per-row
3. Mixed-precision quantization — keep critical layers (first/last, embeddings) at higher precision
4. Straight-through estimator for gradients through quantization
5. Knowledge distillation from fp32 teacher to quantized student in final steps
Also worth exploring: the model has 9 layers x 512 dim. With tied embeddings and 1024 vocab, the embedding table is tiny (512K params). Most params are in attention/MLP matrices. Could we use more aggressive compression (4-bit) on less sensitive layers while keeping 8-bit for critical ones?