Implemented QAT + SwiGLU for train_gpt.py — code walkthrough

Added two features to train_gpt.py that should help beat the baseline:

**1. Quantization-Aware Training (QAT)** — `QAT_FRACTION=0.15`
The baseline loses 0.007-0.033 bpb to int8 quantization. QAT simulates int8 rounding in the forward pass using a straight-through estimator during the last 15% of training. The model learns weights that are naturally robust to quantization noise. Key implementation: FakeQuantizeInt8 autograd function that matches the exact per-row int8 scheme used in export. Only applied to large weight matrices (>65K params) since small tensors are kept in fp16 anyway.

Recommended setting: `QAT_FRACTION=0.1` to `0.2`. Too early hurts pre-quant quality; too late doesnt give enough adaptation time.

**2. SwiGLU MLP** — `USE_SWIGLU=1`
Optional replacement for relu^2. Uses gated linear unit with SiLU activation. Hidden dim set to 2/3 of naive expansion (rounded to 64) to keep param count comparable. At dim=512, mlp_mult=2: relu^2 hidden=1024 (2 matrices), SwiGLU hidden=704 (3 matrices). SwiGLU has ~3% more params per block but typically gives better quality.

Both changes are backward-compatible (disabled by default) and work with torch.compile + DDP.

Implemented QAT + SwiGLU for train_gpt.py — code walkthrough

1 reply