Meta training Llama 3 encountered frequent failures, and the 16,384 H100 GPU training cluster "struck" every 3 hours

Meta A research report released by the company shows that the 16,384Nvidia H100 Graphics card clusterThere were 419 unexpected failures in 54 days, an average of one every three hours, of which more than half were caused by the graphics card or its high-bandwidth memory (HBM3).

Meta training Llama 3 encountered frequent failures, and the 16,384 H100 GPU training cluster "struck" every 3 hours

Due to the huge scale of the system and the high degree of synchronization of tasks, a single graphics card failure may cause the entire training task to be interrupted and need to be restarted.The Meta team still maintained an effective training time of more than 90%.

During the 54-day pre-training period, there were 466 work interruptions, of which 47 were planned and 419 were unplanned. Planned interruptions were caused by automation maintenance, while unplanned interruptions were mainly caused by hardware problems. GPU Problems are the leading cause of outages, accounting for 58.7% of unexpected outagesOnly three of these incidents required significant human intervention, with the rest managed by automation.

Meta training Llama 3 encountered frequent failures, and the 16,384 H100 GPU training cluster "struck" every 3 hours

Of the 419 unexpected outages, 148 (30.1%) were caused by various GPU failures (including NVLink failures), while 72 (17.2%) were caused by GPU HBM3 memory failures. Interestingly, only two CPU failures occurred in 54 days. 41.3% of the unexpected outages were caused by a variety of factors, including software errors, network cables, and network adapters.

To improve efficiency, the Meta team developed a series of tools and optimization strategies, including shortening task startup and checkpoint time, using PyTorch's NCCL flight recorder to diagnose performance issues, identifying lagging graphics cards, etc. In addition, Meta also paid attention to the impact of environmental factors, such as the slight impact of midday temperature fluctuations on GPU performance, and the huge pressure on the data center power grid caused by the simultaneous operation of a large number of GPUs.

However, as the number of AI model parameters continues to increase, the required computing resources also expand. Taking the 100,000 H100 graphics card cluster in the xAI plan as an example, the failure rate may increase exponentially, bringing greater challenges to future AI training.

statement:The content is collected from various media platforms such as public websites. If the included content infringes on your rights, please contact us by email and we will deal with it as soon as possible.
Information

NIO releases "China's first" intelligent driving world model NWM: deduce 216 possible scenarios in 0.1 seconds

2024-7-28 8:40:51

Information

Apple's new AI features may be delayed until iOS 18.1

2024-7-29 9:41:40

Search