Attention Residuals: How Kimi Is Rethinking Transformer Depth
Every transformer you've ever used stacks layers with a dead-simple formula: take the input, add the layer's output, move on. x + layer(x). Fixed weight of 1. No questions asked. The Kimi team at M...

Source: DEV Community
Every transformer you've ever used stacks layers with a dead-simple formula: take the input, add the layer's output, move on. x + layer(x). Fixed weight of 1. No questions asked. The Kimi team at Moonshot AI just published a paper that asks: what if that's been wrong the whole time? The Problem Nobody Talks About Standard residual connections accumulate layer outputs with equal weight. Layer 1 contributes the same as layer 47. The hidden state grows without bound as you stack more layers, and each individual layer's contribution gets diluted into the noise. This is called PreNorm dilution — and it gets worse the deeper your model goes. At 100+ layers, the early layers are essentially screaming into a hurricane. Their signal is there, mathematically, but it's buried under the sum of everything that came after. For most of transformer history, we've papered over this with normalization tricks. RMSNorm, LayerNorm, various pre-norm and post-norm arrangements. They help. They don't solve th