Alibaba Cloud Tongyi Qianwen open-sources two voice base models, with better recognition performance than OpenAI Whisper

Alibaba CloudThousand Questions on TongyiOpen SourceTwo modelsVoice Base Model SenseVoice (for speech recognition) and CosyVoice (for speech generation).

Alibaba Cloud Tongyi Qianwen open-sources two voice base models, with better recognition performance than OpenAI Whisper

SenseVoice focuses onHigh-precision multi-language speech recognition, emotion recognition, and audio event detection, has the following characteristics:

  • Multi-language recognition: Using more than 400,000 hours of data training, supporting more than 50 languages,The recognition effect is better than the Whisper model

  • Rich text recognition: It has excellent emotion recognition and can be used on test dataAchieve or exceed the performance of the best emotion recognition models;Supports sound event detection capabilities, including music, applause, laughter, crying, coughing, sneezing and other common human-computer interaction events for detection

  • Efficient reasoning: The SenseVoice-Small model uses a non-autoregressive end-to-end framework with extremely low inference latency. Inference of 10s audio takes only 70ms.15 times better than Whisper-Large

  • Fine-tuning customization: It has convenient fine-tuning scripts and strategies to help users fix long-tail sample problems according to business scenarios

  • Service deployment:It has a complete service deployment link, supports multiple concurrent requests, and supports client languages such as python, c++, html, java and c#

Compared with open source emotion recognition models, the SenseVoice-Large model canAchieved the best results on almost all data, and the SenseVoice-Small model can also outperform other open source models on most data sets.

Alibaba Cloud Tongyi Qianwen open-sources two voice base models, with better recognition performance than OpenAI Whisper

The CosyVoice model also supports multilingualism, timbre, and emotion control. The model performs well in multilingual speech, zero-sample speech generation, cross-lingual voice cloning, and command following.

Related Links:

SenseVoice:https://github.com/FunAudioLLM/SenseVoice

CosyVoice:https://github.com/FunAudioLLM/CosyVoice

statement:The content is collected from various media platforms such as public websites. If the included content infringes on your rights, please contact us by email and we will deal with it as soon as possible.
Information

Meta AI develops a compact language model MobileLLM for mobile devices with only 350 million parameters

2024-7-9 10:13:05

Information

Using magic to defeat magic, telecom companies have deployed AI to curb scam calls

2024-7-9 13:31:19

Search