top golfers — best: 1.06107587 bpb

1. 1.06107587 bpb — 11L XSA + LQER + SparseAttnGate + SmearGate (BOS-fixed) + PolarNS Muon + 9-hparam stack (Benjamin Hadad) 15.9MB
11L 512d 8H/4KV transformer with U-Net skips, parallel decoder, partial RoPE, Polar-Express Newton-Schulz Muon (5 steps), LQER asymmetric int4 rank-4 quant correction, sparse attention head-output gate (gate_window=12), SmearGate position-mixing (with cross-document leak fix on BOS positions), fused LeakyReLU-square MLP, fused softcapped CE Triton kernel, GPTQ int6 + int7 embed + per-row int8 attn-gate quantization, per-group lrzip+brotli compression, phased TTT eval (3 phases, prefix=2500 docs). Plus 9 greedy-validated hyperparameter overrides on top of the published baseline. 3-seed mean: 1.06107587 bpb, beating the current official leaderboard (1.0810 bpb) by 0.01992 bpb / 0.04359 nats.
2. 1.06128 bpb — SmearGate BOS Fix + PR #1787 Base + LQER Asym + Phased TTT (aquarious dante workman)
3. 1.06145 bpb — 3-Seed Compliance Reproduction â Support for PR #1851 (Christopher-Lee-McClendon)
4. 1.06335 bpb — PR #1736 + Polar Express NS + MIN_LR + Sparse Attention Gate + Fused CE + PR #1767 TTT (nprime06)
Stacks four training-time wins on top of #1736 (Polar Express NS, MIN_LR=0.10, sparse attention gate, fused softcapped CE) plus PR #1767's TTT improvements (LoRA alpha=144, warm-start A, WD=1.0). Training-time and TTT changes are orthogonal: training produces quantized artifacts, TTT improvements only affect the eval-time adaptation. All training wins were ablation-validated on stock #1736 seed 0. TTT improvements validated via eval-only mode on saved quantized artifacts.
5. 1.06453 bpb — SP8192 + CaseOps + Gated Attention + Quant Gate + Loop4-5 + Phased TTT + MLPClip12 (dexhunter) 16.0MB
Retune of the MLP GPTQ outlier-clip on top of the PR #1736 CaseOps+GatedAttn+QuantGate+Loop4-5+PhasedTTT stack. One-line change: mlp_clip_sigmas default 10.0 -> 12.0 (less aggressive outlier clipping preserves MLP tail mass that carries signal at int6). -0.00096 bpb vs PR #1736 on 5-seed mean.
6. 1.06549 bpb — SP8192 + CaseOps (lossless case preprocessing) + Gated Attention + Quant Gate + Loop4-5 + Phased TTT (dexhunter)
CaseOps reversible case-preprocessing tokenizer (adds TITLE/ALLCAPS/CAPNEXT/ESC as user_defined_symbols) on the PR #1530 SP8192 stack, plus a lightweight learned attention-output gate and quant-gate scaling, Loop45 depth recurrence, multi-phase SGD score-first TTT. BPB scored on ORIGINAL pre-transform UTF-8 bytes via a per-token byte sidecar.
7. 1.06780094 bpb — CaseOps Tokenizer + Mild WD Taper (romeerp)
8. 1.07139 bpb — SmearGate + Attention Output Gate + Score-First TTT (MarioPaerle)
9. 1.07193 bpb — VarLen Attention + Fused MLP + Multi-Phase Global SGD TTT + Trimmed GPTQ + MLR 0.026 (dexhunter)
10. 1.07280628 bpb — VarLenAttn + PhasingTTT (romeerp)
11. 1.07336388 bpb — Varlen Attention + Fused MLP + doc-independent TTT (samacqua)
12. 1.07983273 bpb — SP8192 + Muon momentum 0.97 + Legal Score-First TTT + Causal N-gram Token Tilt (dexhunter) 16.0MB
PR #1394 sp8192 stack + muon_momentum=0.97 (vs 0.99 default) + legal score-first TTT + causal n-gram token tilt (within/word experts disabled, within_beta=0 word_beta=0). 3-seed mean 1.07983 across seeds 0/42/1234. Beats PR #1493 (1.0810) by 0.00117 bpb = 0.00302 nats/token on the current merged legal-track SOTA.
13. 1.081 bpb — SP8192 + 3-Layer Recurrence + Parallel Residuals + QK-Gain 5.25 + Legal Score-First TTT (bigbag)
14. 1.08217956 bpb — SP8192 + Parallel Residuals + Score-First TTT (aryanbhosale)
15. 1.08279384 bpb — SP8192 + QK-Gain 5 + Legal Score-First TTT (dexhunter) 16.0MB
PR #1394 sp8192 stack + QK_GAIN_INIT=5.0 + legal score-first TTT (TTT_LR=0.005, freeze 0 blocks, 3 epochs). 3-seed mean 1.08279 across seeds 0/42/1234. No SLOT, no ETLB, no pre-quant TTT. Beats PR #1394 by 0.00283 bpb = 0.00731 nats/token.
16. 1.08354 bpb — Non-record: Parallel Residuals + Hessian-Aware SDClip (Robby Sneiderman) 16.0MB
Three zero-cost modifications on PR #1394: GPT-J parallel residuals (layers 7+), Hessian-diagonal SDClip modulation (lambda=0.175), two-phase progressive recurrence. 3-seed mean 1.08354 bpb.
17. 1.08563 bpb — SP8192 + GPTQ Embeddings + Depth Recurrence + MuonEq-R + SD-Clip + Simplifications (Kevin Clark) 16.0MB
8192-token sentencepiece vocab, GPTQ-quantize embeddings, loop layers 4,5 two times, row-normalized muon, standard-deviation based weight clipping
18. 1.08971631 bpb — SP4096 + Depth Recurrence + Parallel Residuals + MuonEq-R + QK-Gain 5.0 (aryanbhosale)
19. 1.0912 bpb — Record: MuonEq-R + Depth Recurrence + WD=0.090 + All-Int6 GPTQ (dexhunter) 16.0MB
WD-quantization synergy: higher weight decay (0.090) improves compression enough to keep ALL 66 layers at int6. Combined with MuonEq-R and depth recurrence. 3-seed mean 1.0912 bpb / 2.5106 nats. No TTT, no SLOT.
20. 1.09785 bpb — 4096-Vocab + Larger Model + High WD + Simplifications (Kevin Clark) 15.9MB
Vocab 4096, MLP 4x, WD 0.085, co-prime data loader, GPTQ, brotli, sigmoid-gated UNet skips, simplified architecture
21. 1.10625353 bpb — Parallel Residuals + Mini Depth Recurrence (Marko Sisovic) 15.9MB
Built from PR #1179 with AR self-generated GPTQ and mixed quantization ported from PR #1105. Adds parallel residual routing from layer 7 plus delayed mini depth recurrence on layers 4,5 with untied repeated MLPs. Exact 3-seed mean: 1.10625353 bpb / 1.86785780 nats, improving on PR #1179 by 0.00722646 nats and on the current merged SOTA by 0.01432073 nats.
22. 1.1122 bpb — Coprime-Stride Loader + Full Hessian GPTQ + XSA-all + BigramHash (Dex Hunter) 16.0MB
Coprime-stride multi-shard data pipeline (PR #726 style) + Full Hessian GPTQ with Cholesky error compensation + XSA on all 11 layers + BigramHash(2816x112) + EMA + Parallel Muon. No TTT (sliding-only outperforms TTT on this stack). 3-seed mean: 1.1122 (std 0.0004). Built on PR #549.
23. 1.11473509 bpb — AR Self-Gen GPTQ + XSA-all + BigramHash 3072x112 (abaybektursun) 16.0MB
11L XSA-all + Full Hessian GPTQ with autoregressive self-generated calibration (no val/train data accessed during quantization) + selective-pruning stack. BigramHash(3072,112), warmdown=4000, lzma preset=9. 3-seed exact mean: 1.11473509 bpb / 1.88217853 nats, beating PR549's exact 3-seed mean 1.11937967 bpb / 1.89002068 nats by 0.00784215 nats (Welch t=-11.83, df=3.31).
24. 1.1194 bpb — LeakyReLUÂ² + Legal Score-First TTT + Parallel Muon (abaybektursun) 16.0MB
LeakyReLU(0.5)Â² activation (-0.003 bpb vs reluÂ²) + legal score-first TTT (PR #461 recipe, 3ep SGD, all blocks unfrozen) + BigramHash(1536) + Parameter Banking + Parallel Muon (PR #399). Built on PR #414 stack. 3-seed mean: 1.1194 (std 0.0006). All artifacts under 16MB, all eval under 10 min.
25. 1.12278022 bpb — Record: 11L EMA + GPTQ-lite + warmdown3500 + QAT@0.15 (Tianhao Wu) 15.6MB
EMA(0.997) weight averaging + GPTQ-lite optimal clip percentile search + warmdown=3500 + Late QAT threshold=0.15, built on PR#374 stack (11L, XSA4, Partial RoPE 16/64, LN Scale, VE128, Tight SWA, SmearGate, BigramHash, int6+zstd-22).
26. 1.12484502 bpb — Record: 11L Partial RoPE + LN Scale + EMA + XSA4 (Jack Princz) 15.6MB
11 layers with Partial RoPE (16 of 64 dims), LN Scale (1/sqrt(l+1)), EMA weight averaging (decay=0.997), Exclusive Self Attention (XSA) on last 4 layers. Int6 per-row on all MLP+attention weights, int8 tok_emb, zstd-22. Weight decay 0.04 (Muon+AdamW). OrthoInit + muP scaling. SmearGate + BigramHash(2048x128). FA3. Sliding window eval stride=64, seq=2048. Note: Late QAT flag is present in the code but inactive due to torch.compile constant-folding.
27. 1.12707468 bpb — Record: 11L XSA + EMA + Int6 MLP3x + WD=0.04 (Jack Princz) 15.5MB
11 layers with Exclusive Self Attention (XSA) on last 4 layers, EMA weight averaging (decay=0.997), int6 per-row on all MLP+attention weights, int8 tok_emb, zstd-22. Weight decay 0.04 (Muon+AdamW). OrthoInit + muP scaling. SmearGate + BigramHash(2048x128). FA3. Sliding window eval stride=64, seq=2048.
28. 1.13071416 bpb — 11L + Efficient Partial XSA + FA3 + SWA/120 (val_bpb: 1.1307) (vadim borisov (tabularis.ai)) 15.9MB
11 layers, int6 quant, zstd-22. Novel contribution: Efficient Partial Exclusive Self Attention (XSA, arXiv:2603.09078) applied to deepest 3 layers only. GQA-aware reshape avoids tensor duplication, adding <2ms/step overhead. XSA subtracts self-value projection from attention output, forcing deeper layers to learn from context rather than self-reference. SWA every 120 steps (13 checkpoint avg). OrthoInit + muP scaling. SmearGate + BigramHash(2048x128). FlashAttention 3 + NTK RoPE. Weight decay 0.04 (Muon+AdamW).
29. 1.14581692 bpb — Int6 MLP3x + SmearGate + BigramHash + OrthoInit + Muon WD + SWA (Raahil Shah) 15.9MB
Per-row int6 quantization on MLP/attention weights with zstd-22 compression, enabling 3x MLP expansion (hidden=1536). SmearGate blends adjacent token embeddings via a learned gate. BigramHash embedding (4096 buckets, dim=128) captures token-pair context. Orthogonal weight initialization with muP output scaling. Muon optimizer with decoupled weight decay (WD=0.04) and momentum warmup (0.92->0.99 over 1500 steps). Stochastic Weight Averaging every 50 steps over the last 50% of training. Trained at seq_len=2048 with batch=786432, grad_clip=0.3, warmdown=3000. Sliding window evaluation at stride=64.
30. 1.15015359 bpb — undefined (aruniyer)
31. 1.1556 bpb — undefined (undefined)
32. 1.1574404 bpb — Int6 MLP3x Sliding Window (samuellarson) 16.0MB
Int6 post-training quantization enables 3x MLP expansion (21.8M params in 16MB). Combined with train@2048 + sliding window eval + FP16 tied embeddings + Late-K passthrough.
33. 1.15861696 bpb — 10L Int6 QAT + Zstd MLP2.6x Muon0.99 Sliding Window (yahya010) 15.6MB
10-layer 512dim SP-1024, STE int6 QAT (zero quant gap), full int6 [-31,31] + zstd-22, MLP hidden=1344, fp16 tied embedding, Muon 0.99, LR 0.02, grad clip 0.3, sliding window stride=64.
34. 1.16301431 bpb — Mixed Quant (int6 blocks + int8 embeddings) + Sliding Window Eval, val_bpb=1.1630 (aquariouseworkman) 15.4MB
3x MLP expansion with mixed-precision quantization: int6 per-row (31 levels) on STE-protected block weights, int8 per-row (127 levels) on embedding, zlib-9 compression, sliding window evaluation at stride=64.
35. 1.19250007 bpb — Sliding Window Eval (stride=64) (Matthew Li) 15.9MB
Baseline 9x512 SP-1024 architecture with sliding window evaluation at stride=64. Each token is scored with 960+ tokens of context instead of the baseline's 0-1023. Training is identical to the naive baseline; the improvement comes entirely from the evaluation strategy. Post-quant int8+zlib roundtrip under the 16,000,000-byte cap.
36. 1.1929 bpb — LoRA TTT (sam) 15.9MB
Naive baseline + per-document LoRA test-time training at eval. Rank-8 LoRA on lm_head/Q/V with Adam lr=0.01, overlapping 256-token chunks in 1024-token context windows. Same training, smarter eval.
37. 1.20143417 bpb — Training Opt Seq4096 v1 (Spokane Way) 15.9MB
SP-1024 9x512 KV4 run at TRAIN_SEQ_LEN=4096 with aggressively tuned Muon optimizer: momentum 0.99, lower LR (0.020/0.020/0.030), 3/4 batch (393K tokens), warmdown 3000 steps, and extended momentum warmup (1500 steps from 0.92). Combines long-context training with training optimization to beat the naive baseline by 0.023 bpb.
38. 1.20576485 bpb — Long Context Seq2048 v2 (Spokane Way) 15.9MB
SP-1024 9x512 KV4 run at TRAIN_SEQ_LEN=2048 with tuned seq2048 learning rates (0.040/0.032/0.032). This standalone record script reproduces the SXM-verified 10-minute artifact under the 16,000,000-byte cap.
39. 1.214745 bpb — 10L Mixed Precision (Nan Liu) 15.9MB
10-layer 512-dim model with lower LR (MATRIX_LR=0.02) and mixed int8/int6 compression: full int8 for first/last 3 layers, int6 (step=4 rounding) for middle layers 3-6. Fits 16MB via better compression while gaining an extra transformer layer over baseline.
40. 1.21972502 bpb — FP16 Tied Embedding + LR/Warmdown Tuning (Renier Velazco) 15.9MB
Keep tok_emb.weight in fp16 during int8 quantization to eliminate the output-head quantization gap (0.007 -> 0.0005 bpb). Slightly reduce MLP hidden (992 vs 1024) to fit within 16MB. Tune warmdown (3600 vs 1200) and matrix LR (0.06 vs 0.04) for better convergence under the 10-min wallclock cap.
41. 1.22296644 bpb — Lower LR (Nan Liu) 15.9MB
Same 9x512 SP-1024 KV4 tied-embedding baseline architecture with lower Muon/Adam learning rates (MATRIX_LR=0.02, SCALAR_LR=0.02, TIED_EMBED_LR=0.03). Systematic LR sweep showed default 0.04 was too high; optimal is ~0.02.
42. 1.2243657 bpb — Naive Baseline (Baseline) 15.9MB
SP-1024 9x512 KV4 run on pgut1 using the published Hugging Face fineweb10B_sp1024 export and the current train_gpt.py; score is the default final int8+zlib roundtrip metric under the 16,000,000-byte cap.
43. undefined bpb — Sliding Window + FP16 Embed + 10L + Muon WD + Overtone Init (notapplica)
44. undefined bpb — 10L Int5-MLP + BigramHash(10240) + SWA(frac=0.4) + WD=0.04 (thwu1) 15.9MB
10 layers with mixed int5/int6 quantization. BigramHash 10240 buckets (up from 4096). SWA start_frac=0.4 (24 converged checkpoints). WD=0.04 global, warmdown=3000. Mean of 3 seeds: 1.14276 (std 0.00016). SmearGate + OrthoInit + zstd-22.

100 open PRs at https://github.com/openai/parameter-golf/pulls

https://github.com/openai/parameter-golf

top golfers — best: 1.06107587 bpb

0 replies