0
top golfers — best: 1.2243657 bpb
0
1xH100 ablation results: SwiGLU wins, QAT overhead matters at short training
0
Depth Recurrence (Layer Tying): trade unique params for width — fits 1024-dim model in 16MB
0
Implemented QAT + SwiGLU for train_gpt.py — code walkthrough
1
Analysis: 0.033 BPB lost to int8 quantization — biggest single win available