It’s time to upgrade storage capacity. The top supercomputers give the answer
Supercomputers are the jewel of the computing industry and the vessel for humankind to explore the unknown. The development and changes of supercomputers not only represent the technological competitiveness of various countries and regions, but also serve as a trend indicator, affecting the direction of the entire digital system.
At this stage, the integration of supercomputing and AI computing is an inevitable trend. In order to integrate AI models and AI computing into supercomputing systems, a new round of supercomputing changes is brewing. At the same time, a key question has also emerged: Do we need to build a new and independent storage system to match the trend of AI big models?
Oak Ridge National Laboratory, which has a wide reputation in the field of supercomputing, has given a clear answer to this question: yes.
Recently, we have seen in the plan released by Oak Ridge Laboratory for building the next-generation data center by 2027 that it has clearly proposed to cope with the introduction of billion to tens of billions of large models. In addition to the PFS (Parallel file system) for traditional HPC scenarios, it is also necessary to set up a separate AOS (AI-optimized storage) storage system, and provide relevant detailed category definitions and specification constraints.
Why is this information important, and how will it impact the continued development of the computing and storage industries?
Let us jointly interpret the power beacon of this intelligent era.
Supercomputing Answers at the Top of Science
Not long ago, the film "Oppenheimer" directed by Nolan was a global hit, and its depiction of the Manhattan Project was impressive.
In fact, the Manhattan Project's impact goes far beyond the scope of the film.USAThe Department of Energy's Oak Ridge National Laboratory was originally part of the Manhattan Project. As the most representative national laboratory in the United States and even the world, its purpose is to solve the most serious scientific problems and develop technologies with epoch-making significance.
From developing nuclear reactors in the 1940s, to pioneering neutron scattering and conducting material research, to providing a series of information and related technologies for the semiconductor industry, Oak Ridge National Laboratory has been deeply involved in major scientific discoveries in the information age at every stage and is known as the pinnacle of human science.
Today, Oak Ridge National Laboratory's most famous capability is supercomputing. In the 2022 Global Supercomputing Top500 list, Oak Ridge National Laboratory's Frontier supercomputer system won first place. With an HPL score of 1.102 Exaflop/s, it became the first "E-class supercomputer" computer in human history. In other words, Frontier has achieved a generation-defining amazing computing power, and its supercomputing performance is greater than the sum of the 468 supercomputing systems behind it. At the same time, Frontier is also one of the most powerful computing systems in the world for AI computing, and its AI computing capabilities have been invested in the exploration of smart transportation, smart medical care and other fields.
It can be seen that Oak Ridge National Laboratory is extremely advanced in the field of supercomputing and can be regarded as an absolute authority in the field of supercomputing in a broad sense. In the process of building supercomputing systems represented by Frontier, the laboratory is also looking at the frontier exploration of AI computing and storage with a deeper vision.
The answers they provide about AI storage capacity can obviously be used as a reference for more supercomputing systems and even the overall digital construction.
A clear definition of the AI storage base
For a long time, we have known the importance of AI-specific computing power. So is it necessary to build AI-specific storage power in the storage field? This has always been a hotly debated issue in the industry. And the answer from Oak Ridge National Laboratory may have the final say. In its plan to build the next-generation data center for 2027, it is clearly stated that in the face of the introduction of large models, in addition to the storage system for traditional supercomputing scenarios, a separate AOS (AI-optimized storage) category should be established. This means that two sets of I/O storage systems, PFS and AOS, should be built for traditional supercomputing services and AI business loads, that is, to build special storage that is more adaptable and more matched to AI loads.
This is because Oak Ridge National Laboratory has realized that future supercomputers will face more and more AI processing tasks. This requires not only the improvement of computing power systems, but also the upgrade of storage systems. Therefore, it is crucial to customize a new storage subsystem for AI workloads.
If we compare the two I/O storage systems, it is easy to find the difference.
Traditional PFS is more targeted at a single POSIX file namespace. The overall business I/O is large, and more operations are performed on large files. It emphasizes cluster aggregation bandwidth and does not have high requirements for the creation or reading performance of small files.
Compared with PFS, AI applications reflect more complex files on AOS loads, with different sizes, and data-intensive analysis accounts for a larger proportion, which will generate a large amount of random read and write of data or metadata in the whole process. This requires the storage system to have up to 10 million IOPS and OPS, as well as ultra-high bandwidth of 10 TB/s for high-speed sequential read and write.
In short, the new AI workload will bring huge storage performance requirements, which is what traditional PFS systems cannot afford. Only by greatly upgrading storage performance can the utilization of AI computing power be improved and the training efficiency of the entire model be upgraded.
Secondly, it is extremely important that the failure rate of computing nodes in AI scenarios is high, with failures occurring on average every day or even every hour, so frequent breakpoint resumption training is required, and a lot of staged model data and form data may need to be saved regularly. Therefore, compared with traditional supercomputing tasks, AI tasks require storage with larger capacity and higher efficiency.
Next, we need to see the necessity of shared storage. Oak Ridge National Laboratory requires that computing tasks can randomly access any file on any computing node, thereby ensuring that the performance of AI tasks is strongly consistent when accessed on any node.
In addition, AOS also has the ability to efficiently transmit parallel data between the underlying file system and AOS, thereby ensuring the cross-layer scheduling capability of files.
In order to protect precious AI data assets, AOS has also greatly increased the requirements for storage reliability. Since various AI trainings are largely distributed, it is necessary to maintain high data availability and uninterrupted tasks after a single point failure. This requires cross-node EC (Erasure Coding) capabilities. Unlike some traditional parallel file systems that can only achieve EC within a node, when a node goes down, data loss and data integrity will be damaged. In addition, the performance speed of data reconstruction after a failure also specifies the relevant time.
Finally, AOS also needs to have the ability to clean and process local data, including removing sensitive information, filtering privacy information, and even transcoding and deduplication, so as to simplify data pre-training and improve the overall efficiency of AI tasks.
In summary, Oak Ridge National Laboratory has clearly stated that the wave of AI large models requires not only dedicated computing power, but also dedicated storage power. Traditional parallel file systems can no longer meet the needs of AI tasks. The threshold for AI storage is becoming higher and the definition is becoming clearer.
Starting from the supercomputing exploration of Oak Ridge National Laboratory, the concept of AI storage will affect the entire industry.
A beacon of storage development
The discovery of Oak Ridge National Laboratory can be said to be a beacon of the times. It will radiate to a wider area and send a clear signal to the upgrading and development of the storage industry.
First, the industry can reach a consensus: AI requires professional computing power and professional storage. The concept of AI storage power will become the backbone of the storage industry in the era of big models.
Secondly, we can see that the supercomputing field will be the first to be inspired. In various countries and regions around the world, supercomputing is a national weapon and a key node in the scientific and technological competition. Under the development trend of supercomputing and AI blending seamlessly, supercomputing scenarios must actively introduce AI storage upgrades, set up professional external storage, and actively practice storage-based computing to improve AI computing power utilization through storage upgrades. For example, before performing intensive calculations on large AI models, in order to reduce the computing and communication overhead ratio, part of the data preprocessing can be sunk to the storage layer to save AI computing power. Ultimately, storage can be used to improve the advancement and autonomy of the supercomputing system.
Next, we can also see that this trend will be released beyond the supercomputing scenario. As AI big models enter thousands of industries, each field needs to consider whether storage can adapt to AI models and computing systems. Timely storage upgrades to achieve the complementary relationship between storage, computing, and AI are the key to the development of intelligence.
These revelations are of vital importance to the development of China's storage industry.
The power of preservation is booming, the choice of the times
In the development of big models, storage capacity is a prerequisite and also the pillar of the industry. This is especially true for China's efforts to achieve self-reliance in science and technology and promote the integration of digital and physical. The AI wave happens to be an excellent opportunity to achieve a comprehensive upgrade of the storage industry at the lowest cost and the highest value.
Judging from the current global mainstream trends, storage upgrades are multi-faceted and comprehensive in helping the development of AI. It is a high-throughput, shareable, large-capacity, and highly reliable storage system, and is the key to the development of industrial and economic intelligence.
Under this trend, China needs to seize the following opportunities in its capacity building:
1. Expand storage capacity and increase the proportion of advanced storage.
With the rise of AI big models and the penetration of AI into supercomputing, large government and enterprise digitalization and other scenarios, more companies will tend to conduct localized AI training and related data storage. In this process, it is necessary to expand the overall scale of storage capacity and increase the proportion of advanced storage represented by all-flash memory to meet the needs of intelligent development.
2. Improve storage technology innovation to cope with data complexity in the AI era.
AI brings a series of challenges such as data complexity and application process diversity, so the advancement of storage must be further improved. For example, in the process of building a data lake, the data collection of multiple data centers and multiple business systems is slow and complex, and the cross-business data switching is inefficient and cumbersome, which poses a challenge to storage. Therefore, storage needs to improve capabilities such as protocol interoperability, cross-domain data scheduling, and cross-system visual data management. Innovate storage technology to meet a series of technical challenges in the AI era.
3. Improve storage security and operation and maintenance capabilities to ensure worry-free development of AI.
AI big models not only bring complexity to data, but also bring a series of new security risks and increasingly complex storage operation and maintenance management pressure. Therefore, storage needs to actively implement active security, automatic operation and maintenance capabilities to ensure the healthy development of AI systems.
With the unremitting efforts, AI storage will be greatly developed. Just as we know that AI computing power is productivity, AI storage will also become the key to releasing productivity and the engine of industrial intelligence in the future.
In summary, an industry upgrade and technological development first needs to find beacons and understand trends. If there was controversy over the definition and development of AI-specific storage before, then Oak Ridge National Laboratory's definition of future data centers has put an end to this debate.
Relying on its status in the field of supercomputing and even in the global scientific research community, we can first see the absolute necessity of AI storage itself. Secondly, we can put forward detailed requirements for the definition, threshold and development specifications of AI storage. In this way, we can clearly see the inevitability of storage upgrades in the era of AI big models with more and more evidence.
The value of AI storage can be proven in the demonstration and exploration of top laboratories; it can be proven in the development of the storage industry towards autonomy and advancement over the years; and it can be proven in the model developers' admiration for the value of storage after each AI training.
Seizing the opportunities of AI and promoting the development of existing power is the choice and blessing of the times.