Dark Side of the Moon Kimi opens Moonlight: 3 billion / 16 billion parameter hybrid expert model

February 24th.Dark Side of the Moon Kimi Yesterday, we released a new technical report on "Muon Scalable for LLM Training" and announced the launch of "Moonlight": a 3 billion / 16 billion parameterized system trained on Muon.hybrid expert model(MoE). Using 5.7 trillion tokens, better performance is achieved at lower floating point operations counts (FLOPs), thus improving the Pareto efficiency bound.

Dark Side of the Moon says the team discovered that the Muon optimizer can be used by theAdd weight attenuation, carefully adjust the update magnitude of each parameterand other technologies are extended with the following highlights:

These techniques allow Muon to be used out-of-the-box for large-scale training without the need for hyperparameter tuning. Expansion law experiments show that Muon achieves about 2x computational efficiency compared to AdamW, which computes optimal training.

The model used in this thesis is Moonlight-16B-A3B, with a total number of parameters of 15.29B and an activation parameter of 2.24B, which uses the Muon optimizer to obtain the above results with 5.7T Tokens of training data.

Our model not only breaks the current Pareto frontiers, but also achieves better performance than previous models with a significantly reduced number of FLOPs required for training.

We open-source a distributed version of our Muon implementation that is optimized for both memory usage and communication efficiency. We have also released pre-trained models, command-tuned models, and intermediate training checkpoints designed to support future research.

The relevant links are attached below:

GitHub:Click here to go

Hugging Face:Click here to go

statement:The content of the source of public various media platforms, if the inclusion of the content violates your rights and interests, please contact the mailbox, this site will be the first time to deal with.

Dark Side of the Moon Kimi Open Source Moonlight: 3 Billion / 16 Billion Parameter Mixed Expert Models

DeepSeek-R1 Becomes Hugging Face's Most Popular Large Model, Beating Nearly 1.5 Million "Competitors"

Google's AI video generation model Veo 2 usage fees announced: $30 per minute

AI Weibo

AI Applications

5000+ AI applications! Updated daily

1AICLUB

Highly recommended! Official brand Weibo

AI Tutorials

Tons of tutorials to read

AI Basic Training Camp

Zero-based entry, leading you to become an AI expert

1ai tiktok

1ai master

TikTok account: 1ai.net

1ai master

TikTok account: 1ai.net

1ai WeChat

Five minutes a day

Become a master in one year

Scan the QR code to follow

Related content:

DeepSeek-R1 Becomes Hugging Face's Most Popular Large Model, Beating Nearly 1.5 Million "Competitors"

Google's AI video generation model Veo 2 usage fees announced: $30 per minute

Dark Side of the Moon Releases Kimi Explorer Edition: 10x More Searchable Than Regular Edition, 500 Pages of Elaborate Reading

Dark Side of the Moon Kimi Founder Yang Shilin Says AI Talent Returning to Big Firms is an Industry Law, Has Taken the Initiative to Subtract Businesses

Kimi Math Edition Launched: Based on the Dark Side of the Moon k0-math model, claiming to benchmark OpenAI o1 capabilities.

Dark Side of the Moon Kimi Open Source Big Model Reasoning Architecture Mooncake with Tsinghua University and others

AI Applications

5000+ AI applications! Updated daily

1AICLUB

Highly recommended! Official brand Weibo

AI Tutorials

Tons of tutorials to read

AI Basic Training Camp

Zero-based entry, leading you to become an AI expert

1ai master

TikTok account: 1ai.net

1ai master

TikTok account: 1ai.net

Five minutes a day

Become a master in one year

Scan the QR code to follow