Researchers at Amazon Cloud AI Labs found that a large amount of content on the Internet comes from machine translation (MT), and the quality of these translated content across multiple languages is generally low. The research team emphasized that this highlights the importance of training large language models (LLM) hour,dataThe importance of quality and provenance considerations.
Image source: Pexels
The study also found that machine-generated content is prevalent in translations from lower-resource languages and accounts for a large portion of web content.
IT Home noted that the research team developed a huge resource called the Multidimensional cc Matrix (MWccMatrix) to better understand the characteristics of machine translation content. The resource contains 6.4 billion unique sentences in 90 languages and includes translation tuples, which are a group of sentences translated into each other.
The study found that a large amount of web content is often translated into multiple languages, primarily through machine translation. This content is not only prevalent in translations from lower-resource languages, but also accounts for a large portion of all web content in these languages.
The researchers also noted a selective bias in content being translated into multiple languages for purposes such as advertising revenue.
The paper concludes: "Machine translation technology has made significant progress in the past decade, but still falls short of human quality. Over the years, machine-translated content has been added to the web using the machine translation systems available at the time, so much of the machine-translated content on the web is likely of low quality by modern standards. This can cause LLM models to produce more 'hallucinations', while selection bias suggests that data quality may be low even without accounting for machine translation errors. Data quality is critical for LLM training, where high-quality corpora, such as books and Wikipedia articles, are often upsampled multiple times."