Chinese Internet Corpus AI Resource Platform Released: 27 Datasets, Total 2.7T

January 11 news, China Cyberspace Security Association on January 9 issued a notice for the community to release the Chinese InternetcorpusResource platform that supports a variety of labeling categories such as industry sector, content modality, volume size, etc., which makes it easy for users to download and use.

Chinese Internet Corpus AI Resource Platform Released: 27 Datasets, Total 2.7T

The Association indicated that under the guidance of the Central Internet Information Office, together with the National Internet Emergency Response Center, on the basis of the release of the Chinese Basic Internet Corpus 1.0 in the previous period, and relying on the corpus construction and sharing mechanism established by the Specialized Committee, it gathered a batch of new high-quality and credible data, and went through a series of rigorous and meticulous data processing and handling measures, such as source screening, content filtering, and data de-emphasis.Formed and released to the public the Chinese Internet Basic Corpus 2.0, with a size of 120GB and 38 million data items.

Note: 27 corpora are currently hosted on the platform.DatasetThe total amount of data is about 2.7T, which is divided into three main categories:

  • First, the Chinese Internet basic corpus built by the China Cyberspace Security Association together with the National Internet Emergency Response Center and others;
  • The second is the Internet corpus shared by People's Daily, Beijing Zhiyuan Research Institute, and Shanghai Artificial Intelligence Laboratory;
  • The third is the high-quality Chinese basic corpus samples contributed by the China Institute of Cyberspace Research, the National Version Library of China, the Encyclopedia of China Publishing House, and the Library of the Chinese Academy of Social Sciences.

Users can log in to the website of China Association for Cyberspace Security (https://www.cybersac.cn/newhome), click on the link "Chinese Internet Corpus Resource Platform", and pass the procedures of registration and authentication to download the relevant corpus.

The person in charge of the Special Committee on Artificial Intelligence Security Governance of the Internet Security Association said that data is a key resource for the development of artificial intelligence, and the Chinese Internet Basic Corpus 2.0 is another important achievement of the collaborative efforts of all sectors to build a high-quality Chinese corpus, and that the Special Committee will continue to strengthen the construction of the Chinese Internet Basic Corpus to provide strong support and guarantee for the technological innovation and industrial development of artificial intelligence.

statement:The content of the source of public various media platforms, if the inclusion of the content violates your rights and interests, please contact the mailbox, this site will be the first time to deal with.
Information

Galaxy released the world's first end-to-end body grasping base model, GraspVLA, with pre-training data of one billion frames of "vision-language-action" pairs.

2025-1-10 21:02:27

Information

"Scrap" is worth money: Google, OpenAI exposed for buying unreleased video footage from creators to train AI models

2025-1-11 21:10:31

Search