existLarge ModelAgainst the backdrop of rapid technological development,Kunlun WanweicompanyOpen SourceA landmark sparse large language modelSkywork-MoEThis model not only excels in performance, but also significantly reduces the inference cost, providing an effective solution to the challenges posed by large-scale intensive LLMs.
Skywork-MoE model features:
Open source and free for commercial use: Skywork-MoE's model weights and technical reports are completely open source and free for commercial use without application.
Reduced inference cost: This model significantly reduces the inference cost while maintaining strong performance.
Sparse Models: Skywork-MoE is a mixture of experts (MoE) model that provides a more economically viable alternative by distributing computation to specialized sub-models or “experts”.
Supports reasoning on a single 4090 server: It is the first open source MoE large model that supports reasoning on a single 4090 server.
Technical details:
Model weights and open source repository: Model weights can be downloaded from Hugging Face, and the open source repository is located on GitHub.
Inference code: Provides code to support 8-bit quantized load inference on 8x4090 servers.
Performance: On the 8x4090 server, using the non-uniform Tensor Parallel parallel reasoning method pioneered by the Kunlun Wanwei team, Skywork-MoE can achieve a throughput of 2200 tokens/s.
Model performance and technological innovation:
Parameter size: The total parameter size of Skywork-MoE is 146B, the activation parameter size is 22B, there are 16 experts in total, and the size of each expert is 13B.
Performance comparison: With the same number of activation parameters, Skywork-MoE is at the forefront of the industry, with a Dense model close to 70B and a nearly 3-fold reduction in inference cost.
Training optimization algorithm: Skywork-MoE designs two training optimization algorithms, including Gating Logits normalization operation and adaptive Aux Loss, to solve the problems of difficult training and poor generalization performance of MoE models.
Large-scale distributed training:
Expert Data Parallel: A new parallel design scheme is proposed to efficiently partition the model when the number of experts is small.
Non-uniform splitting and pipeline parallelism: A non-uniform pipeline parallel splitting and heavy calculation layer allocation method is proposed to make the computing/graphics memory load more balanced.
Experiments and Rules of Thumb:
Scaling Law Experiment: Explores the constraints that affect the quality of Upcycling and From Scratch training MoE models.
Training experience rule: If the FLOPs of training MoE model is more than 2 times that of training Dense model, it is better to choose From Scratch to train MoE; otherwise, choose Upcycling to train MoE to reduce training cost.
The open sourcing of Skywork-MoE brings a powerful new tool to the large model community, helping to advance the field of artificial intelligence, especially in scenarios that require processing large amounts of data and where computational resources are limited.
Project page: https://top.aibase.com/tool/skywork-moe
Model download address: https://huggingface.co/Skywork/Skywork-MoE-Base