Chinese Internet Corpus AI Resource Platform Released: 27 Datasets, Total 2.7T

January 11 news, China Cyberspace Security Association on January 9 issued a notice for the community to release the Chinese InternetcorpusResource platform that supports a variety of labeling categories such as industry sector, content modality, volume size, etc., which makes it easy for users to download and use.

Chinese Internet Corpus AI Resource Platform Released: 27 Datasets, Total 2.7T

The Association indicated that under the guidance of the Central Internet Information Office, together with the National Internet Emergency Response Center, on the basis of the release of the Chinese Basic Internet Corpus 1.0 in the previous period, and relying on the corpus construction and sharing mechanism established by the Specialized Committee, it gathered a batch of new high-quality and credible data, and went through a series of rigorous and meticulous data processing and handling measures, such as source screening, content filtering, and data de-emphasis.Formed and released to the public the Chinese Internet Basic Corpus 2.0, with a size of 120GB and 38 million data items.

Note: 27 corpora are currently hosted on the platform.DatasetThe total amount of data is about 2.7T, which is divided into three main categories:

  • First, the Chinese Internet basic corpus built by the China Cyberspace Security Association together with the National Internet Emergency Response Center and others;
  • The second is the Internet corpus shared by People's Daily, Beijing Zhiyuan Research Institute, and Shanghai Artificial Intelligence Laboratory;
  • The third is the high-quality Chinese basic corpus samples contributed by the China Institute of Cyberspace Research, the National Version Library of China, the Encyclopedia of China Publishing House, and the Library of the Chinese Academy of Social Sciences.

Users can log in to the website of China Association for Cyberspace Security (https://www.cybersac.cn/newhome), click on the link "Chinese Internet Corpus Resource Platform", and pass the procedures of registration and authentication to download the relevant corpus.

The person in charge of the Special Committee on Artificial Intelligence Security Governance of the Internet Security Association said that data is a key resource for the development of artificial intelligence, and the Chinese Internet Basic Corpus 2.0 is another important achievement of the collaborative efforts of all sectors to build a high-quality Chinese corpus, and that the Special Committee will continue to strengthen the construction of the Chinese Internet Basic Corpus to provide strong support and guarantee for the technological innovation and industrial development of artificial intelligence.

statement:The content is collected from various media platforms such as public websites. If the included content infringes on your rights, please contact us by email and we will deal with it as soon as possible.
Information

Galaxy released the world's first end-to-end body grasping base model, GraspVLA, with pre-training data of one billion frames of "vision-language-action" pairs.

2025-1-10 21:02:27

Information

"Scrap" is worth money: Google, OpenAI exposed for buying unreleased video footage from creators to train AI models

2025-1-11 21:10:31

Search