The leading open source conversational AI toolkit NVIDIA NeMo announces the Parakeet ASR model series, a series ofFirstThe most advanced automatic speech recognition (ASR) model in the industry, capable of transcribing spoken English with outstanding accuracy. Developed in partnership with Suno.ai, the Parakeet ASR model is a breakthrough in speech recognition, paving the way for more natural and efficient human-computer interaction.
According to the developers, the models are robust to non-speech clips such as music and silence, and outperformed OpenAI’s Whisper v3. They also provide user-friendly integration into projects via pre-trained control points.
NVIDIA announced four Parakeet models based on RNN Transducer / Connectionist Temporal Classification decoders with 60-110 million parameters. They are able to cope with a variety of audio environments and achieve excellent word error rate (WER) performance on benchmark datasets after training with only 64,000 hours of datasets, outperforming previous models.
Parakeet RNNT1.1B - optimalRecognition accuracy, moderate inference speed. Best used when the most accurate transcription is needed.
Parakeet CTC1.1B - Fast inference speed and strong recognition accuracy. A good balance between accuracy and inference speed.
Parakeet RNNT0.6B - Strong recognition accuracy and fast inference speed. Suitable for large-scale inference with limited resources.
Parakeet CTC0.6B - Fastest with moderate recognition accuracy. Very useful in situations where transcription speed is most important.
The Parakeet model is robust to non-speech segments, including music and silence, effectively preventing the generation of fictitious transcription results. Parakeet is built on the NVIDIA NeMo toolkit, focusing on user-friendliness and flexibility. Pre-trained checkpoints are available for direct use, making it very convenient to integrate the model into your project. Whether looking for immediate reasoning capabilities or fine-tuning for specific tasks, NeMo provides a powerful and intuitive framework to fully realize the potential of the model.
The main advantages of the Parakeet model include:
- FirstAdvanced Accuracy: Excellent WER performance across a variety of audio sources and domains, and robust to non-speech segments.
- Different model sizes: Two models with 0.6B and 1.1B parameters are provided, which can provide powerful understanding of complex speech patterns.
- Open source and extensible: Built on NVIDIA NeMo, it can be seamlessly integrated and customized.
- Pretrained checkpoints: plug-and-play models that can be used for inference or fine-tuning.
- Permissive License: Released under the CC-BY-4.0 license, model checkpoints can be used in any commercial application.
Parakeet is a major advancement in the development of conversational AI. Its outstanding accuracy, combined with the flexibility and ease of use provided by NeMo, enables developers to create more natural, intuitive voice applications. From improving the accuracy of virtual assistants to enabling seamless real-time communication, the possibilities are endless. The Parakeet family of models has achievedFirstUsers can try parakeet-rnnt-1.1b for themselves and use it in the Gradio demo. To access the model locally and explore the toolkit, visit the NVIDIA NeMo Github page.
Official blog URL: https://nvidia.github.io/NeMo/blogs/2024/2024-01-parakeet/