AMD: Has Integrated DeepSeek-V3 Models on Instinct MI300X GPUs, Revolutionizing AI Development with SGLang

AMD announced on January 25 that it has placed a new DeepSeek-V3 model integrated into the Instinct MI300X GPU, which is SGLang-enhanced and optimized for Al inference.

AMD: Has Integrated DeepSeek-V3 Models on Instinct MI300X GPUs, Revolutionizing AI Development with SGLang

A query by 1AI revealed that AMD announced support for the DeepSeek-V3 model on Github back on December 26th of last year SGLang v0.4.1.

AMD said that DeepSeek V3 is the strongest open source LLM available today, surpassing even GPT-4o. AMD also revealed that the SGLang and DeepSeek teams worked together to get DeepSeek V3 FP8 running on both NVIDIA and AMD GPUs from the day of its debut. AMD also thanked the Mission Search and Recommendation Algorithm Platform team and DataCrunch for providing GPU resources.

The DeepSeek-V3 model is described as a powerful Mixed Expert (MoE) language model with 671B total parameters and 37B parameters activated per token.

For efficient inference and cost-effective training, DeepSeek-V3 utilizes the Multihead Latent Attention (MLA) and DeepSeekMoE architectures.

In addition, DeepSeek-V3 pioneers a load-balancing strategy without auxiliary losses and sets up multi-marker predictive training targets for more robust performance.

DeepSeek-V3 enables developers to use advanced models that utilize in-memory capabilities to process both textual and visual data, giving developers broad access to advanced features and providing them with more functionality.

AMD Instinct GPU Accelerator and DeepSeek-V3

AMD said that the extensive FP8 support in ROCm significantly improves the process of running AI models, especially in the area of inference. It helps address key issues such as memory bottlenecks and high latency issues associated with more read and write formats, enabling the platform to handle larger models or batches within the same hardware constraints, leading to a more efficient training and inference process.

In addition, FP8 reduced-precision computation reduces latency in data transfer and computation.AMD ROCm extends its ecosystem with support for FP8, enabling improved performance and efficiency in every aspect, from frameworks to libraries.

statement:The content of the source of public various media platforms, if the inclusion of the content violates your rights and interests, please contact the mailbox, this site will be the first time to deal with.
Information

Liu Qingfeng: KU Xunfei will be a full-stack autonomous and controllable big model national team

2025-1-26 8:21:36

Information

China Telecom releases TeleAI-t1-preview, a large model of complex reasoning that can solve the questions of the Nine Chapters of the Mathematical Art.

2025-1-26 8:25:29

Search