AmazonA team of artificial intelligence researchers has announced the development of what it says is the largestText-to-speech model,This model has the most parameters and uses the largest training dataset.The researchers have published a paper on the arXiv preprint server detailing the model's development and training process.
In recent years, "large language models" such as ChatGPT have attracted much attention for their ability to intelligently answer questions and generate advanced text. However, artificial intelligence is also gradually being integrated into other mainstream application areas. In this new project, researchers try to improve the capabilities of text-to-speech applications by increasing the number of parameters and expanding the training data set.
The new model, called Scalable Streaming Text-to-Speech (BASE TTS), has 980 million parameters and was trained using 100,000 hours of recordings (from public websites), most of which were in English. The researchers also provided the model with examples of words and phrases in other languages, enabling it to correctly pronounce some common expressions, such as "au contraire" and "adios, amigo."
The Amazon team also tested models using smaller datasets, hoping to discover what is known in the AI field as "emergent capabilities." This is the phenomenon where AI applications, whether large language models or text-to-speech models, suddenly break through to a higher level of intelligence. They found that for text-to-speech applications, this leap occurred on medium-sized datasets with 150 million parameters.
The researchers also noted that this leap involves a range of language attributes, such as the ability to use compound nouns, express emotions, use foreign words, apply phonetics and punctuation, and correctly emphasize key words in sentences.
The research team said that due to concerns about potential abuse, BASE TTS will not be open to the public. They plan to use it as a learning application and hope to apply what they learn to improve the overall sound quality of text-to-speech applications.