January 11 news, China Cyberspace Security Association on January 9 issued a notice for the community to release the Chinese InternetcorpusResource platform that supports a variety of labeling categories such as industry sector, content modality, volume size, etc., which makes it easy for users to download and use.
The Association indicated that under the guidance of the Central Internet Information Office, together with the National Internet Emergency Response Center, on the basis of the release of the Chinese Basic Internet Corpus 1.0 in the previous period, and relying on the corpus construction and sharing mechanism established by the Specialized Committee, it gathered a batch of new high-quality and credible data, and went through a series of rigorous and meticulous data processing and handling measures, such as source screening, content filtering, and data de-emphasis.Formed and released to the public the Chinese Internet Basic Corpus 2.0, with a size of 120GB and 38 million data items.
Note: 27 corpora are currently hosted on the platform.DatasetThe total amount of data is about 2.7T, which is divided into three main categories:
- First, the Chinese Internet basic corpus built by the China Cyberspace Security Association together with the National Internet Emergency Response Center and others;
- The second is the Internet corpus shared by People's Daily, Beijing Zhiyuan Research Institute, and Shanghai Artificial Intelligence Laboratory;
- The third is the high-quality Chinese basic corpus samples contributed by the China Institute of Cyberspace Research, the National Version Library of China, the Encyclopedia of China Publishing House, and the Library of the Chinese Academy of Social Sciences.
Users can log in to the website of China Association for Cyberspace Security (https://www.cybersac.cn/newhome), click on the link "Chinese Internet Corpus Resource Platform", and pass the procedures of registration and authentication to download the relevant corpus.
The person in charge of the Special Committee on Artificial Intelligence Security Governance of the Internet Security Association said that data is a key resource for the development of artificial intelligence, and the Chinese Internet Basic Corpus 2.0 is another important achievement of the collaborative efforts of all sectors to build a high-quality Chinese corpus, and that the Special Committee will continue to strengthen the construction of the Chinese Internet Basic Corpus to provide strong support and guarantee for the technological innovation and industrial development of artificial intelligence.