X @Avi Chawla - Reportify

RT Avi Chawla (@_avichawla)Big release from Kimi!They just released a new way to handle residual connections in Transformers.In a standard Transformer, every sub-layer (attention or MLP) computes an output and adds it back to the input via a residual connection.If you consider this across 40+ layers, the hidden state at any layer is just the equal-weighted sum of all previous layer outputs.Every layer contributes with weight=1, so every layer gets equal importance.This creates a problem called PreNorm dilut ...