NVIDIA Releases Nemotron-CC, a 6.3 Trillion Token Large-Scale AI Training Database

according toNvidiaJanuary 13 official blog, NVIDIA announced a large English language called Nemotron-CC AI Trainingcomprehensive database,Total of 6.3 trillion tokens, of which 1.9 trillion are synthetic data. NVIDIA claims the training database can help further the process of training big language models for academia and the corporate world.

NVIDIA Releases Nemotron-CC, a 6.3 Trillion Token Large-Scale AI Training Database

At present, the specific performance of various AI models in the industry mainly depends on the training data of the corresponding models. However, existing public databases often have limitations in terms of size and quality, and NVIDIA says that Nemotron-CC is designed to address this bottleneck. The training database, which is 6.3 trillion tokens in size, contains a large amount of high-quality, verified data, and is claimed to be the "ideal material for training large-scale language models".

As for the data source, Nemotron-CC was constructed based on data from the Common Crawl website, and a high-quality subset, Nemotron-CC-HQ, was extracted after a rigorous data processing process.

In terms of performance, NVIDIA says that compared to DCLM (Deep Common Crawl Language Model), which is currently the industry's leading publicly available English language training database, models trained with Nemotron-CC-HQ scored 5.6 points better in the MMLU (Massive Multitask Language Understanding) The model trained with Nemotron-CC-HQ improved its score in the MMLU (Massive Multitask Language Understanding) benchmark test by 5.6 points.

Further testing showed that the 8 billion parameter model trained with Nemotron-CC improved its score by 5 points on the MMLU benchmark, 3.1 points on the ARC-Challenge benchmark, and 0.5 points on the average performance across 10 different tasks.Beyond the Llama 3.1 8B model developed based on the Llama 3 training dataset.

NVIDIA officials said that the development process of Nemotron-CC uses model classifiers, synthetic data rephrasing (Rephrasing) and other techniques to maximize the high quality and diversity of data. At the same time, they also reduced the traditional heuristic filter processing weight for specific high-quality data, thus further increasing the number of high-quality Token in the database and avoiding damage to model accuracy.

1AI notes that NVIDIA has made the Nemotron-CC training database publicly available on the Common Crawl website (Click here to visit), NVIDIA says the documentation will be available on the company's GitHub page at a later date.

statement:The content is collected from various media platforms such as public websites. If the included content infringes on your rights, please contact us by email and we will deal with it as soon as possible.
Information

Adobe Launches New Generative AI Tool for One-Click Batch Editing of 10,000 Images

2025-1-14 11:05:42

Information

UK Government Plans to Procure 100,000 GPUs to Boost Public Sector AI Arithmetic by 20x

2025-1-14 11:12:03

Search