Kunlun Wanwei announces the open source of Skywork-MoE, a 200 billion sparse model with strong performance and lower cost

existLarge ModelAgainst the backdrop of rapid technological development,Kunlun WanweicompanyOpen SourceA landmark sparse large language modelSkywork-MoEThis model not only excels in performance, but also significantly reduces the inference cost, providing an effective solution to the challenges posed by large-scale intensive LLMs.

Kunlun Wanwei announces the open source of Skywork-MoE, a 200 billion sparse model with strong performance and lower cost

Skywork-MoE model features:

Open source and free for commercial use: Skywork-MoE's model weights and technical reports are completely open source and free for commercial use without application.

Reduced inference cost: This model significantly reduces the inference cost while maintaining strong performance.

Sparse Models: Skywork-MoE is a mixture of experts (MoE) model that provides a more economically viable alternative by distributing computation to specialized sub-models or “experts”.

Supports reasoning on a single 4090 server: It is the first open source MoE large model that supports reasoning on a single 4090 server.

Technical details:

Model weights and open source repository: Model weights can be downloaded from Hugging Face, and the open source repository is located on GitHub.

Inference code: Provides code to support 8-bit quantized load inference on 8x4090 servers.

Performance: On the 8x4090 server, using the non-uniform Tensor Parallel parallel reasoning method pioneered by the Kunlun Wanwei team, Skywork-MoE can achieve a throughput of 2200 tokens/s.

Model performance and technological innovation:

Parameter size: The total parameter size of Skywork-MoE is 146B, the activation parameter size is 22B, there are 16 experts in total, and the size of each expert is 13B.

Performance comparison: With the same number of activation parameters, Skywork-MoE is at the forefront of the industry, with a Dense model close to 70B and a nearly 3-fold reduction in inference cost.

Training optimization algorithm: Skywork-MoE designs two training optimization algorithms, including Gating Logits normalization operation and adaptive Aux Loss, to solve the problems of difficult training and poor generalization performance of MoE models.

Large-scale distributed training:

Expert Data Parallel: A new parallel design scheme is proposed to efficiently partition the model when the number of experts is small.

Non-uniform splitting and pipeline parallelism: A non-uniform pipeline parallel splitting and heavy calculation layer allocation method is proposed to make the computing/graphics memory load more balanced.

Experiments and Rules of Thumb:

Scaling Law Experiment: Explores the constraints that affect the quality of Upcycling and From Scratch training MoE models.

Training experience rule: If the FLOPs of training MoE model is more than 2 times that of training Dense model, it is better to choose From Scratch to train MoE; otherwise, choose Upcycling to train MoE to reduce training cost.

The open sourcing of Skywork-MoE brings a powerful new tool to the large model community, helping to advance the field of artificial intelligence, especially in scenarios that require processing large amounts of data and where computational resources are limited.

Project page: https://top.aibase.com/tool/skywork-moe

Model download address: https://huggingface.co/Skywork/Skywork-MoE-Base

statement:The content is collected from various media platforms such as public websites. If the included content infringes on your rights, please contact us by email and we will deal with it as soon as possible.
Information

Challenging Nvidia! AMD unveils its most powerful AI chip, the Ryzen AI 300 series, with a computing power of 50TOPS

2024-6-5 9:46:50

Information

Huang Renxun: The next wave is physical AI, which will usher in a new era of robots

2024-6-5 11:12:48

Search