February 24th.Dark Side of the Moon Kimi Yesterday, we released a new technical report on "Muon Scalable for LLM Training" and announced the launch of "Moonlight": a 3 billion / 16 billion parameterized system trained on Muon.hybrid expert model(MoE). Using 5.7 trillion tokens, better performance is achieved at lower floating point operations counts (FLOPs), thus improving the Pareto efficiency bound.
Dark Side of the Moon says the team discovered that the Muon optimizer can be used by theAdd weight attenuation, carefully adjust the update magnitude of each parameterand other technologies are extended with the following highlights:
- These techniques allow Muon to be used out-of-the-box for large-scale training without the need for hyperparameter tuning. Expansion law experiments show that Muon achieves about 2x computational efficiency compared to AdamW, which computes optimal training.
The model used in this thesis is Moonlight-16B-A3B, with a total number of parameters of 15.29B and an activation parameter of 2.24B, which uses the Muon optimizer to obtain the above results with 5.7T Tokens of training data.
- Our model not only breaks the current Pareto frontiers, but also achieves better performance than previous models with a significantly reduced number of FLOPs required for training.
- We open-source a distributed version of our Muon implementation that is optimized for both memory usage and communication efficiency. We have also released pre-trained models, command-tuned models, and intermediate training checkpoints designed to support future research.
The relevant links are attached below:
GitHub:Click here to go
Hugging Face:Click here to go