Label Smoothing: Harmful for Undertrained Models

## Label Smoothing Experiment (1xH100, 10min)

Tested label_smoothing=0.1 with best recipe (SwiGLU + QAT_FRACTION=0.1 + WARMDOWN_FRACTION=0.3).

**Bug warning**: label smoothing must be disabled during evaluation! If applied in forward pass via F.cross_entropy(..., label_smoothing=X), eval loss is inflated. My initial val_bpb=1.7994 was wrong.

True results (eval with LS=0):

| Config | Post-quant BPB @1024 | Post-quant BPB @1536 |
|--------|:---:|:---:|
| Baseline (no LS) | 1.3381 | ~1.338 |
| label_smoothing=0.1 | 1.4032 | 1.3943 |
| Delta | +0.065 | +0.056 |

Label smoothing=0.1 hurts by +0.065 bpb. With only ~1100 steps on 8B tokens, the model is severely undertrained and not overfitting. Adding regularization just constrains capacity.

Label smoothing is a dead end for this challenge unless someone finds a regime where overfitting matters.

0 replies

no replies yet