研究：网络充斥低质机翻内容，大语言模型训练需警惕数据陷阱

Researchers at Amazon Cloud AI Labs found that a large amount of content on the Internet comes from machine translation (MT), and the quality of these translated content across multiple languages is generally low. The research team emphasized that this highlights the importance of training large language models (LLM) hour,dataThe importance of quality and provenance considerations.

Image source: Pexels

The study also found that machine-generated content is prevalent in translations from lower-resource languages and accounts for a large portion of web content.

IT Home noted that the research team developed a huge resource called the Multidimensional cc Matrix (MWccMatrix) to better understand the characteristics of machine translation content. The resource contains 6.4 billion unique sentences in 90 languages and includes translation tuples, which are a group of sentences translated into each other.

The study found that a large amount of web content is often translated into multiple languages, primarily through machine translation. This content is not only prevalent in translations from lower-resource languages, but also accounts for a large portion of all web content in these languages.

The researchers also noted a selective bias in content being translated into multiple languages for purposes such as advertising revenue.

The paper concludes: "Machine translation technology has made significant progress in the past decade, but still falls short of human quality. Over the years, machine-translated content has been added to the web using the machine translation systems available at the time, so much of the machine-translated content on the web is likely of low quality by modern standards. This can cause LLM models to produce more 'hallucinations', while selection bias suggests that data quality may be low even without accounting for machine translation errors. Data quality is critical for LLM training, where high-quality corpora, such as books and Wikipedia articles, are often upsampled multiple times."

statement:The content of the source of public various media platforms, if the inclusion of the content violates your rights and interests, please contact the mailbox, this site will be the first time to deal with.

Research: The Internet is full of low-quality machine-translated content, and large language model training needs to be wary of data traps

Amazon launches AI shopping assistant Rufus, which can answer product information, make suggestions, etc.

Google to rename Bard to Gemini and launch standalone app

AI Weibo

AI Applications

5000+ AI applications! Updated daily

1AICLUB

Highly recommended! Official brand Weibo

AI Tutorials

Tons of tutorials to read

AI Basic Training Camp

Zero-based entry, leading you to become an AI expert

1ai tiktok

1ai master

TikTok account: 1ai.net

1ai master

TikTok account: 1ai.net

1ai WeChat

Five minutes a day

Become a master in one year

Scan the QR code to follow

Related content:

Amazon launches AI shopping assistant Rufus, which can answer product information, make suggestions, etc.

Google to rename Bard to Gemini and launch standalone app

Canalys: Chinese manufacturers are expected to be the first to bring AI mobile phones to lower price segments

MIT Technology Review: Data is the foundation of generative AI

Tsai Chongxin: China's AI technology may lag behind OpenAI in the United States by two years

Meta AI expands global market and launches web version meta.ai

AI Applications

5000+ AI applications! Updated daily

1AICLUB

Highly recommended! Official brand Weibo

AI Tutorials

Tons of tutorials to read

AI Basic Training Camp

Zero-based entry, leading you to become an AI expert

1ai master

TikTok account: 1ai.net

1ai master

TikTok account: 1ai.net

Five minutes a day

Become a master in one year

Scan the QR code to follow