recently,AlibabaBased on its Qwen-Audio, it launched a new open sourceVoice Model Qwen2-Audio. This model not only performs well in speech recognition, translation, and audio analysis, but also achieves significant improvements in functionality and performance. Qwen2-Audio provides a basic version and a command fine-tuning version. Users can ask questions to the audio model through voice, and recognize and analyze the content.
For example, users can ask a woman to speak a paragraph, and Qwen2-Audio can determine her age or analyze her emotions; if a noisy sound is input, the model can analyze the various sound components in it. Qwen2-Audio supports multiple languages including Chinese, Cantonese, French, English and Japanese, which greatly facilitates the development of sentiment analysis and translation applications.
Product entrance: https://top.aibase.com/tool/qwen2-audio
Compared with the first generation of Qwen-Audio, Qwen2-Audio has been fully optimized in terms of architecture and performance. In the pre-training stage, this new model uses more natural language prompts to replace the previous complex hierarchical labels. This improvement makes the model more handy in understanding and responding to various tasks, and its generalization ability has also been significantly improved.
Qwen2-Audio's command-following ability has also been greatly improved, and it can understand user commands more accurately. For example, when a user issues a command to "analyze the emotional tendency in this audio", Qwen2-Audio can accurately judge the emotions contained in the audio. In addition, the model introduces two modes: voice chat and audio analysis, making the user's voice interaction more natural. In audio analysis mode, Qwen2-Audio can deeply analyze various types of audio and provide detailed and accurate analysis results.
To ensure that the model's output meets human expectations, Qwen2-Audio also introduces advanced techniques such as supervised fine-tuning and direct preference optimization. When interacting with humans, the model appears more natural and accurate.
In terms of performance testing, Qwen2-Audio performed well in multiple mainstream benchmarks, especially in speech recognition and translation accuracy, surpassing OpenAI's Whisper-large-v3. The performance of this new model has not only attracted widespread attention in the industry, but also heralded a new future for voice technology.