Meta Training Llama 3 Suffers Frequent Failures, 16,384 H100 GPU Training Cluster "Strikes" Every 3 Hours

Meta A research report released by the company shows that the 16,384Nvidia H100 Graphics card clusterThere were 419 unexpected failures in 54 days, an average of one every three hours, of which more than half were caused by the graphics card or its high-bandwidth memory (HBM3).

Due to the huge scale of the system and the high degree of synchronization of tasks, a single graphics card failure may cause the entire training task to be interrupted and need to be restarted.The Meta team still maintained an effective training time of more than 90%.

During the 54-day pre-training period, there were 466 work interruptions, of which 47 were planned and 419 were unplanned. Planned interruptions were caused by automation maintenance, while unplanned interruptions were mainly caused by hardware problems. GPU Problems are the leading cause of outages, accounting for 58.7% of unexpected outagesOnly three of these incidents required significant human intervention, with the rest managed by automation.

Meta training Llama 3 encountered frequent failures, and the 16,384 H100 GPU training cluster "struck" every 3 hours

Of the 419 unexpected outages, 148 (30.1%) were caused by various GPU failures (including NVLink failures), while 72 (17.2%) were caused by GPU HBM3 memory failures. Interestingly, only two CPU failures occurred in 54 days. 41.3% of the unexpected outages were caused by a variety of factors, including software errors, network cables, and network adapters.

To improve efficiency, the Meta team developed a series of tools and optimization strategies, including shortening task startup and checkpoint time, using PyTorch's NCCL flight recorder to diagnose performance issues, identifying lagging graphics cards, etc. In addition, Meta also paid attention to the impact of environmental factors, such as the slight impact of midday temperature fluctuations on GPU performance, and the huge pressure on the data center power grid caused by the simultaneous operation of a large number of GPUs.

However, as the number of AI model parameters continues to increase, the required computing resources also expand. Taking the 100,000 H100 graphics card cluster in the xAI plan as an example, the failure rate may increase exponentially, bringing greater challenges to future AI training.

statement:The content of the source of public various media platforms, if the inclusion of the content violates your rights and interests, please contact the mailbox, this site will be the first time to deal with.

Meta training Llama 3 encountered frequent failures, and the 16,384 H100 GPU training cluster "struck" every 3 hours

NIO releases "China's first" intelligent driving world model NWM: deduce 216 possible scenarios in 0.1 seconds

Apple's new AI features may be delayed until iOS 18.1

AI Weibo

AI Applications

5000+ AI applications! Updated daily

1AICLUB

Highly recommended! Official brand Weibo

AI Tutorials

Tons of tutorials to read

AI Basic Training Camp

Zero-based entry, leading you to become an AI expert

1ai tiktok

1ai master

TikTok account: 1ai.net

1ai master

TikTok account: 1ai.net

1ai WeChat

Five minutes a day

Become a master in one year

Scan the QR code to follow

Related content:

NIO releases "China's first" intelligent driving world model NWM: deduce 216 possible scenarios in 0.1 seconds

Apple's new AI features may be delayed until iOS 18.1

Meta builds two new data center clusters: containing more than 49,000 NVIDIA H100 GPUs, dedicated to training Llama3

Industry leaders drive AI hardware innovation, top ten outstanding figures in the field of artificial intelligence hardware

NVIDIA releases AI Enterprise 5.0 to help enterprises develop generative AI

Nvidia H100 AI GPU shortage eases, delivery time drops from 3-4 months to 2-3 months

AI Applications

5000+ AI applications! Updated daily

1AICLUB

Highly recommended! Official brand Weibo

AI Tutorials

Tons of tutorials to read

AI Basic Training Camp

Zero-based entry, leading you to become an AI expert

1ai master

TikTok account: 1ai.net

1ai master

TikTok account: 1ai.net

Five minutes a day

Become a master in one year

Scan the QR code to follow