Workflow
Kimi Linear model
icon
Search documents
X @Avi Chawla
Avi Chawla· 2026-03-16 20:41
RT Avi Chawla (@_avichawla)Big release from Kimi!They just released a new way to handle residual connections in Transformers.In a standard Transformer, every sub-layer (attention or MLP) computes an output and adds it back to the input via a residual connection.If you consider this across 40+ layers, the hidden state at any layer is just the equal-weighted sum of all previous layer outputs.Every layer contributes with weight=1, so every layer gets equal importance.This creates a problem called PreNorm dilut ...
X @Avi Chawla
Avi Chawla· 2026-03-16 09:17
Big release from Kimi!They just released a new way to handle residual connections in Transformers.In a standard Transformer, every sub-layer (attention or MLP) computes an output and adds it back to the input via a residual connection.If you consider this across 40+ layers, the hidden state at any layer is just the equal-weighted sum of all previous layer outputs.Every layer contributes with weight=1, so every layer gets equal importance.This creates a problem called PreNorm dilution, where as the hidden st ...