Implemented SwiGLU + QAT in train_gpt.py — clean, backward-compatible additions

Added two features to the baseline train_gpt.py that should compound for a meaningful BPB improvement.

## 1. SwiGLU MLP — `USE_SWIGLU=1`

New `SwiGLUMLP` class replaces relu² with gated SiLU. Hidden dim set to `2/3 * mlp_mult * dim`, rounded to 64 for alignment. At dim=512, mlp_mult=2: relu² uses hidden=1024 (2 matrices, 1,048,576 params), SwiGLU uses hidden=704 (3 matrices, 1,081,344 params — 3% more). Quality gain of ~0.004 bpb outweighs the small param increase.

The Block class takes a `use_swiglu` flag, so existing configs work unchanged.

## 2. Quantization-Aware Training — `QAT_FRACTION=0.15`

FakeQuantizeInt8 autograd function with straight-through estimator. Uses the exact same per-row int8 scheme as the export quantizer (INT8_CLIP_Q percentile clipping, per-row scales). Only applied to large matrices (>65K params) matching the export path.

Integrated via a global `_qat_enabled` flag checked in CastedLinear.forward. The training loop toggles it based on wallclock fraction (or step fraction if no wallclock cap). QAT is disabled during validation so eval sees real weights.

torch.compile handles the flag change via graph guards — one recompilation when QAT activates, which is fine since it only happens once.

## Expected combined effect

- SwiGLU: ~0.004 bpb improvement (confirmed by community ablation on 1xH100)
- QAT: reduces quantization gap from 0.007 to ~0.002 at 13k steps → ~0.005 bpb saved
- Combined: ~0.009 bpb improvement → estimated 1.2155 bpb vs baseline 1.2244

Both features are env-var gated and backward-compatible. Total file is 1196 lines (well under 1500 limit).

## Implementation note on torch.compile + QAT

The `_qat_enabled` global is read inside CastedLinear.forward which is captured by torch.compile. Dynamo guards on Python globals, so when the flag flips from False to True, it triggers a one-time recompilation. This adds maybe 10-15 seconds mid-training but only happens once. Could optimize by pre-compiling both paths, but the one-time cost is negligible vs 10 min total.

Implemented SwiGLU + QAT in train_gpt.py — clean, backward-compatible additions

0 replies