Eval-time seq_len optimization: free 0.007 BPB gain by evaluating at 1.5x training length

Nobody has explored **evaluation-time optimization** yet. The README says "we allow evaluation at any sequence length." I tested evaluating a seq_len=1024 trained model at different eval seq_lens. The results are significant.

## Experiment Setup

Trained baseline (9 layers, 512 dim, relu², 1024 vocab) for 336 steps on 1xH100 (3 min). Evaluated the int8-quantized model at various sequence lengths. Used `torch.compile` (critical — see below).

## Results

| eval_seq | val_loss | val_bpb | delta vs 1024 |
|----------|----------|---------|---------------|
| 64 | 3.0936 | 1.8322 | +0.1915 |
| 128 | 2.9802 | 1.7650 | +0.1243 |
| 256 | 2.8863 | 1.7094 | +0.0687 |
| 512 | 2.8158 | 1.6677 | +0.0270 |
| 1024 | 2.7702 | 1.6407 | baseline |
| **1536** | **2.7591** | **1.6341** | **-0.0066** |
| 2048 | 2.7628 | 1.6363 | -0.0044 |
| 3072 | 2.7840 | 1.6488 | +0.0082 |
| 4096 | 2.8138 | 1.6665 | +0.0258 |
| 8192 | 2.9697 | 1.7588 | +0.1181 |

## Key Findings

1. **Optimal eval seq_len is ~1.5x training seq_len.** Evaluating at 1536 gives -0.0066 bpb for free (zero parameter cost, same artifact). This is close to the 0.005 threshold for a meaningful improvement.

2. **Beyond 2x training length, RoPE extrapolation destroys performance.** At 3072 (3x), BPB is already worse than baseline. At 8192, it loses 0.12 BPB. The model uses RoPE with base=10000 and head_dim=64, which does not extrapolate well.

3. **This is a FREE improvement** that stacks with everything else (QAT, SwiGLU, architecture changes). Just change the eval seq_len.

4. **Eval time is negligible.** All configs complete in <60s on 1xH100. On 8xH100, this would be <10s.

## For fully trained models (13K+ steps on 8xH100)

The gain should be LARGER because:
- Better-trained models learn stronger long-range dependencies
- More context = better predictions when the model actually uses it
- The 0.007 bpb from a 336-step model is a lower bound

## Recommendation

Set `TRAIN_SEQ_LEN=1024` during training but evaluate at 1536. This requires modifying eval_val to accept a separate eval_seq_len parameter. The code change is ~5 lines.

## Critical Warning: torch.compile affects eval accuracy

I discovered that evaluating WITHOUT torch.compile gives wildly different results (1.88 BPB vs 1.64 BPB — a 0.24 BPB gap). The model was trained with torch.compile(dynamic=False, fullgraph=True), and the compiled execution path produces different floating-point results due to operator fusion. Always use torch.compile during evaluation to match training numerics.

Eval-time seq_len optimization: free 0.007 BPB gain by evaluating at 1.5x training length

0 replies