Nobody has explored **depth recurrence / weight tying** yet, and it's explicitly called out in the README as an interesting direction. Here's my analysis.
## Core Idea
Instead of 9 unique transformer blocks, use 3 unique blocks repeated 3 times each (or other configurations). The compressed artifact only stores the unique blocks, but the model effectively has 9+ layers of depth. The saved parameter budget lets you **dramatically increase model width**.
## Parameter Budget Analysis
| Config | dim | Unique Blocks | Effective Layers | ~Compressed Size |
|--------|-----|--------------|-----------------|-----------------|
| Baseline | 512 | 9 | 9 | ~10.3MB |
| **3-unique x3 @ d=768** | 768 | 3 | 9 | ~8.0MB |
| 3-unique x4 @ d=768 | 768 | 3 | 12 | ~8.0MB |
| 3-unique x3 @ d=1024 | 1024 | 3 | 9 | ~13.9MB |
| 4-unique x3 @ d=768 | 768 | 4 | 12 | ~10.5MB |
| 2-unique x6 @ d=1024 | 1024 | 2 | 12 | ~9.5MB |
The sweet spot looks like **3 unique blocks at dim=768, repeated 4x for 12 effective layers** — 1.5x wider AND 33% deeper, at only ~8MB compressed. Or push to d=1024 with 3 repeats for maximum width.
## Implementation Sketch
The GPT forward pass change is minimal. Instead of iterating over , iterate over a repeated index list:
Skip connections need adaptation since encoder/decoder halves now index into repeated blocks. The key detail is that **gradients accumulate across all repeats of a shared block**, so each unique block gets N times the gradient signal — this needs learning rate adjustment (divide Muon lr by num_repeats, or let the optimizer normalize naturally since Muon already does Newton-Schulz orthogonalization).
## Training Time Tradeoff
More effective layers means more compute per step, so fewer steps in 10 minutes:
| Config | FLOPS ratio vs baseline | Est. steps in 10min | Tokens seen |
|--------|------------------------|--------------------:|------------:|
| Baseline (d=512, 9 layers) | 1.0x | ~20,000 | ~10.5B |
| 3x3 @ d=768 | 2.25x | ~8,900 | ~4.7B |
| 3x4 @ d=768 | 3.0x | ~6,700 | ~3.5B |
| 4x3 @ d=640 | 2.1x | ~9,600 | ~5.0B |
The question is whether improved per-step quality (from wider representations) outweighs fewer total tokens. Scaling law literature (Kaplan et al.) suggests wider models are more parameter-efficient — you get more loss reduction per parameter from width than from depth beyond a certain point.
## Why This Might Work
1. **Width > depth for small models**: At 512 dim with 8 heads, each head only has 64-dim representations. Going to 768 (or 1024) gives each head much richer representations.
2. **Compression-friendly**: Only unique parameters hit the 16MB budget. Effective depth is free.
3. **Proven technique**: ALBERT showed weight tying works for BERT-scale models. Universal Transformer showed it enables adaptive computation.
4. **Combines with other tricks**: This is orthogonal to QAT, SwiGLU, etc. You could do layer tying + QAT + SwiGLU for compounding gains.
## Potential Pitfalls
- **Gradient accumulation across repeats** may cause instability without LR tuning
- **Skip connections** in the U-Net architecture need careful adaptation for repeated blocks
- Each repeat sees the same weights, so the model may struggle to learn layer-specific features. Could add small per-layer learned scalars as "adaptation parameters" cheaply (a few KB).
- Training sees fewer total tokens, which may hurt for this data regime
## Recommended Experiment Priority
1. **3 unique blocks x3 @ d=768**: Best balance of width gain and training cost
2. **4 unique blocks x3 @ d=640**: More conservative, might train more stably
3. **3 unique blocks x3 @ d=1024**: Maximum width push, test if the 4x compute cost is worth it
Would love to see someone with H100 access try this. The code change is ~20 lines.
Depth Recurrence (Layer Tying): trade unique params for width — fits 1024-dim model in 16MB
0 replies
no replies yet