Alibaba Tongyi's audio generation model FunAudioLLM is open source and supports scenarios such as emotional voice dialogue and audiobooks

Ali TongyiLaboratory recentlyOpen SourceNamedFunAudioLLMofLarge Model for Audio GenerationThe project aims to improve the natural voice interaction experience between humans and large language models (LLMs). The project consists of two core models: SenseVoice and CosyVoice.

CosyVoice focuses on natural speech generation, with multi-language support, timbre and emotion control functions, and excels in multi-language speech generation, zero-sample speech generation, cross-language sound synthesis and command execution. It supports five languages (Chinese, English, Japanese, Cantonese and Korean) through 150,000 hours of data training, can quickly simulate timbre and provide fine-grained control of emotion and rhythm.

SenseVoice is dedicated to high-precision multi-language speech recognition, emotion recognition, and audio event detection. It has been trained with 400,000 hours of data and supports more than 50 languages. Its recognition effect is better than the Whisper model, especially in Chinese and Cantonese, with an improvement of more than 50%. SenseVoice also has the ability to recognize emotions and detect sound events, as well as fast reasoning speed.

Alibaba Tongyi's audio generation model FunAudioLLM is open source and supports scenarios such as emotional voice dialogue and audiobooks

FunAudioLLM supports a variety of human-computer interaction application scenarios, such as multi-language translation, emotional voice conversations, interactive podcasts and audiobooks, etc. It enables seamless voice-to-voice translation, emotional voice chat applications, and interactive podcast radio stations by combining SenseVoice, LLMs, and CosyVoice.

In terms of technical principles, CosyVoice is based on speech quantization coding and supports natural and fluent speech generation, while SenseVoice provides comprehensive speech processing functions, including automatic speech recognition, language recognition, emotion recognition and audio event detection.

The open source models and codes have been released on ModelScope and Huggingface, and the training, inference, and fine-tuning codes are also available on GitHub. Both the CosyVoice and SenseVoice models have online experiences on ModelScope, allowing users to directly try out these advanced voice technologies.

Project address:https://github.com/FunAudioLLM

statement:The content is collected from various media platforms such as public websites. If the included content infringes on your rights, please contact us by email and we will deal with it as soon as possible.
Information

The novel multimodal recommendation system paradigm DiffMM allows the diffusion model to recommend short videos!

2024-7-8 8:55:43

Information

Alibaba Cloud Wuying Cloud Computer announces the launch of Wuying Xiaoying, a computer-native AI assistant

2024-7-8 8:57:17

Search