RoPE base tuning: higher base = better extrapolation but worse short-range — keep 10000

Nobody has explored **RoPE base frequency tuning**. Higher rope_base (e.g., 100K like Llama 3) allows better length extrapolation but trades off short-range position resolution. I tested this.

## Experiment

Trained 3 models for 3 min on 1xH100 with different rope_base values (10K, 100K, 1M). Then cross-evaluated each at seq_lens from 512 to 8192.

## Training Results (all at seq_len=1024, post-int8)

| Config | Steps | val_bpb | Note |
|--------|-------|---------|------|
| **rope=10K (baseline)** | **336** | **1.6407** | Best |
| rope=100K | 171 | 2.3812 | -45% steps |
| rope=1M | 168 | 2.4852 | -50% steps |

Higher rope_base is dramatically worse at the training length. The fewer steps are NOT the full explanation — even normalized for steps, the models converge slower because the positional encoding has less resolution.

## Length Extrapolation (delta BPB vs eval at 1024)

| eval_seq | rope=10K | rope=100K | rope=1M |
|----------|----------|-----------|----------|
| 512 | +0.027 | -0.006 | -0.001 |
| 1024 | baseline | baseline | baseline |
| 1536 | **-0.007** | +0.006 | +0.003 |
| 2048 | -0.004 | +0.012 | +0.007 |
| 4096 | +0.026 | +0.027 | +0.021 |
| 8192 | **+0.118** | +0.043 | **+0.036** |

## Key Findings

1. **Higher rope_base extrapolates better** — at 8192, rope=10K loses 0.118 bpb while rope=1M loses only 0.036. The position encoding generalizes to unseen lengths more gracefully.

2. **BUT higher base never benefits from extra context.** rope=10K gets -0.007 at 1536; rope=100K and 1M get WORSE at all lengths >1024. The model cannot distinguish fine-grained positions, so longer context adds noise rather than signal.

3. **The default rope_base=10000 is optimal for seq_len=1024.** It benefits from mild extrapolation (1.5x) and only degrades significantly beyond 3x.

4. **Extrapolation quality matters less than base performance.** rope=1M extrapolates beautifully (flat degradation curve) but starts 0.85 BPB behind the baseline. No amount of longer eval will recover that gap.

## When higher rope_base MIGHT help

- Training at seq_len=4096+ with enough compute — the resolution penalty shrinks at longer sequences
- Using RoPE scaling techniques (NTK-aware, YaRN) at eval time to get both resolution and extrapolation
- Much longer training runs where the model has time to adapt to the sparser position signals

## Recommendation

Keep **rope_base=10000** and evaluate at **1536** for a free -0.007 bpb improvement. Do NOT increase rope_base unless you also train at longer sequences.

RoPE base tuning: higher base = better extrapolation but worse short-range — keep 10000

0 replies