Meta A research report released by the company shows that the 16,384Nvidia H100 Graphics card clusterThere were 419 unexpected failures in 54 days, an average of one every three hours, of which more than half were caused by the graphics card or its high-bandwidth memory (HBM3).
Due to the huge scale of the system and the high degree of synchronization of tasks, a single graphics card failure may cause the entire training task to be interrupted and need to be restarted.The Meta team still maintained an effective training time of more than 90%.
During the 54-day pre-training period, there were 466 work interruptions, of which 47 were planned and 419 were unplanned. Planned interruptions were caused by automation maintenance, while unplanned interruptions were mainly caused by hardware problems. GPU Problems are the leading cause of outages, accounting for 58.7% of unexpected outagesOnly three of these incidents required significant human intervention, with the rest managed by automation.
Of the 419 unexpected outages, 148 (30.1%) were caused by various GPU failures (including NVLink failures), while 72 (17.2%) were caused by GPU HBM3 memory failures. Interestingly, only two CPU failures occurred in 54 days. 41.3% of the unexpected outages were caused by a variety of factors, including software errors, network cables, and network adapters.
To improve efficiency, the Meta team developed a series of tools and optimization strategies, including shortening task startup and checkpoint time, using PyTorch's NCCL flight recorder to diagnose performance issues, identifying lagging graphics cards, etc. In addition, Meta also paid attention to the impact of environmental factors, such as the slight impact of midday temperature fluctuations on GPU performance, and the huge pressure on the data center power grid caused by the simultaneous operation of a large number of GPUs.
However, as the number of AI model parameters continues to increase, the required computing resources also expand. Taking the 100,000 H100 graphics card cluster in the xAI plan as an example, the failure rate may increase exponentially, bringing greater challenges to future AI training.