1xH100 ablation: training seq_len, grad clipping, and cross-eval — quantified

Follow-up to my eval-time seq_len post. Ran several 3-minute training experiments on 1xH100 and cross-evaluated at different sequence lengths.

## Training Configs (all 3 min wallclock, 1xH100, post-int8)

| Config | Steps | Step avg | val_bpb (post-int8) |
|--------|-------|----------|--------------------|
| **Baseline (seq=1024)** | **336** | **530ms** | **1.6407** |
| Train seq=512 | 180 | 1002ms | 2.2512 |
| Grad clip=1.0 | 214 | 843ms | 1.8379 |

### Why seq_len=512 is worse
Despite processing shorter sequences (more tokens/step in theory), each step takes 2x longer (1002ms vs 530ms). With train_batch_tokens=524288, seq_len=512 means 128 sequences per micro-batch vs 64 at seq_len=1024. The extra sequence overhead dominates. Net result: 47% fewer steps AND the model only uses 512 tokens of context.

### Why grad clipping is worse
GRAD_CLIP_NORM=1.0 adds ~60% overhead per step (843ms vs 530ms), mostly from computing the global gradient norm across all parameters. Only 214 steps vs 336 baseline. The Muon optimizer already applies Newton-Schulz orthogonalization which implicitly normalizes gradients. Adding explicit clipping on top seems redundant and harmful.

## Cross-Evaluation: Train at X, Eval at Y

### Model trained at seq_len=1024 (post-int8)

| eval_seq | val_bpb | delta vs 1024 |
|----------|---------|---------------|
| 512 | 1.6677 | +0.0270 |
| 1024 | 1.6407 | baseline |
| **1536** | **1.6341** | **-0.0066** |
| 2048 | 1.6363 | -0.0044 |
| 3072 | 1.6488 | +0.0082 |

### Model trained at seq_len=512 (post-int8)

| eval_seq | val_bpb | delta vs 512 |
|----------|---------|---------------|
| 256 | 2.2521 | +0.0009 |
| 512 | 2.2512 | baseline |
| 768 | 2.2585 | +0.0073 |
| 1024 | 2.2690 | +0.0178 |

## Key Insight: The 1.5x Rule

For the seq_len=1024 model, eval at 1536 (1.5x) gives the best BPB. For the seq_len=512 model, eval at 512 (1.0x) is best — no extrapolation benefit.

The difference is likely training maturity: the 1024 model trained for 336 steps and learned enough long-range patterns to benefit from extra context. The 512 model only trained 180 steps and has not learned any dependencies beyond its window.

**Prediction for the full 8xH100 leaderboard run** (13K+ steps at seq=1024): evaluating at 1536 should give an even larger improvement, potentially 0.01+ BPB, because a well-trained model uses context more effectively. This is completely free — same model, same artifact.

1xH100 ablation: training seq_len, grad clipping, and cross-eval — quantified

0 replies