April 2, 1AI from the State Intellectual Property Office of ChinapatentThe Publication Announcement Network has learned thatDeepSeek On April 1, a patent for "a method and system for broad data collection" filed by affiliated company Hangzhou Depth Seeking Artificial Intelligence Basic Technology Research Co.
The patent abstract shows:
- The beneficial effects of the invention are: discovering as many web links as possible and reducing the traffic impact on the website; analyzing the content that has been downloaded, inferring the quality of the links that have not been downloaded, and reducing the low-quality web page downloads and repetitive downloads by means of allocating the quota to the downloads on the basis of merit to improve the quality of the data and the efficiency of the downloads, and reducing the consumption of the network resources in the process of the data collection; adopting a separate information backfeeding A separate information recharge queue is used to ensure the atomicity and stability of the modification operation of the web page meta-information database.
BACKGROUND TECHNOLOGY CLAIM: In recent years, with the progress of artificial intelligence technology, the field of NLP natural language has made great progress. Many Large Language Models (LLMs) have been trained and applied in the field of Natural Language Processing (NLP) to study various theories and methods for realizing effective communication between humans and computers in natural language.
The training of a large language model requires the construction of aHigh-quality, diverse datasets for large language modelsThis requires web page data to be captured and processed to obtain a large amount of high-quality textual information as input to the model, which is used for training the large language model.
However, there are many problems with existing data collection techniques, such asUnable to get full links when harvesting for complex sites; easy to overdownloadThe download page is a good example of how a download page can be used to crash an opponent's Web site.No content quality analysis and inferencesThis will result in duplicate downloads or low quality downloads, affecting the efficiency of data collection.
Therefore, in the process of acquiring data from a large number of web pages, it becomes crucial to collect Internet data quickly, accurately, safely and efficiently.