Research: The Internet is full of low-quality machine-translated content, and large language model training needs to be wary of data traps

Researchers at Amazon Cloud AI Labs found that a large amount of content on the Internet comes from machine translation (MT), and the quality of these translated content across multiple languages is generally low. The research team emphasized that this highlights the importance of training large language models (LLM) hour,dataThe importance of quality and provenance considerations.

Research: The Internet is full of low-quality machine-translated content, and large language model training needs to be wary of data traps

Image source: Pexels

The study also found that machine-generated content is prevalent in translations from lower-resource languages and accounts for a large portion of web content.

IT Home noted that the research team developed a huge resource called the Multidimensional cc Matrix (MWccMatrix) to better understand the characteristics of machine translation content. The resource contains 6.4 billion unique sentences in 90 languages and includes translation tuples, which are a group of sentences translated into each other.

The study found that a large amount of web content is often translated into multiple languages, primarily through machine translation. This content is not only prevalent in translations from lower-resource languages, but also accounts for a large portion of all web content in these languages.

The researchers also noted a selective bias in content being translated into multiple languages for purposes such as advertising revenue.

The paper concludes: "Machine translation technology has made significant progress in the past decade, but still falls short of human quality. Over the years, machine-translated content has been added to the web using the machine translation systems available at the time, so much of the machine-translated content on the web is likely of low quality by modern standards. This can cause LLM models to produce more 'hallucinations', while selection bias suggests that data quality may be low even without accounting for machine translation errors. Data quality is critical for LLM training, where high-quality corpora, such as books and Wikipedia articles, are often upsampled multiple times."

statement:The content is collected from various media platforms such as public websites. If the included content infringes on your rights, please contact us by email and we will deal with it as soon as possible.
Information

Amazon launches AI shopping assistant Rufus, which can answer product information, make suggestions, etc.

2024-2-3 9:15:55

Information

Google to rename Bard to Gemini and launch standalone app

2024-2-4 9:47:10

Search