Hidden Dynamics of Massive Activations in Transformer Training

arXiv:2508.03616v2 Announce Type: replace Abstract: We present the first comprehensive analysis of massive activation development throughout transformer training, using the Pythia model family as our testbed, and release our full dataset publicly to support further research. Through systematic an...

arXiv:2508.03616v2 Announce Type: replace Abstract: We present the first comprehensive analysis of massive activation development throughout transformer training, using the Pythia model family as our testbed, and release our full dataset publicly to support further research. Through systematic analysis of various model sizes across multiple training checkpoints, we demonstrate that massive activation emergence follows highly predictable mathematical patterns that can be accurately modeled using an exponentially-modulated logarithmic function with five key parameters. Additionally, We develop a machine learning framework to predict these mathematical parameters from architectural specifications alone, achieving high accuracy for steady-state behavior and moderate accuracy for emergence timing and magnitude. These findings enable architects to predict and potentially control key aspects of massive activation emergence through design choices, with significant implications for model stability, training cycle length, interpretability, and optimization. Our findings demonstrate that the emergence of massive activations is governed by model design and can be anticipated, and potentially controlled, before training begins. Code is available at https://github.com/Aimpoint-Digital/massive-activations-fork