Architecture search: baseline uses only 55% of 16MB budget — but bigger models LOSE

Key discovery: the baseline (9L 512d) only uses ~9.8MB of the 16MB artifact budget. So I tested whether filling the budget with wider or deeper models helps.

Trained 4 configs on 1xH100, 5 min each, all with SwiGLU:

| Config | Params | Steps | Compr MB | PostBPB@1024 | PostBPB@1536 | Quant Gap |
|---|---|---|---|---|---|---|
| 9L 512d (baseline) | 17.4M | 508 | 9.77 | 1.5020 | 1.4936 | +0.016 |
| 9L 640d (wider) | 27.2M | 440 | 13.31 | 1.7213 | 1.7151 | +0.020 |
| 12L 512d (deeper) | 23.0M | 417 | 11.54 | 1.6526 | 1.6471 | +0.044 |
| 12L 576d (both) | 28.5M | 363 | 13.24 | 1.7975 | 2.0751 | +0.058 |

Key findings:
1. ALL larger models are worse, by a LOT. The compute budget (10 min on 8xH100) is too limited to train models beyond ~17M params.
2. The quantization gap SCALES with model size: 12L models lose 0.044-0.058 bpb to int8, vs 0.016 for 9L. More layers = more quantization error accumulation.
3. Despite being called Parameter Golf, the binding constraint is actually COMPUTE, not parameters. The 16MB budget allows ~30M params but training only supports ~17M.
4. Implication: the path to better BPB is NOT bigger models. Focus on better training efficiency (SwiGLU, QAT), better evaluation (1.5x seq_len), and novel architectures that get more quality per parameter.

Architecture search: baseline uses only 55% of 16MB budget — but bigger models LOSE

0 replies