March 21st.OpenAI In a blog post yesterday (March 20), the company announced the launch of speech-to-text (speech-to-text) and text-to-speech (text-to-speech) models to enhance speech processing capabilities.Support developers to build more accurate and customizable voice interaction systems to further promote the commercial application of AI voice technology.
In terms of speech-to-text models, OpenAI has mainly launched two models, gpt-4o-transcribe and gpt-4o-mini-transcribe, which are officially said to outperform the existing Whisper series in terms of Word Error Rate (WER), language recognition and accuracy.
These two models support more than 100 languages and are mainly trained by reinforcement learning and diverse high-quality audio datasets, which can capture subtle speech features and reduce misrecognition, especially in noisy environments, accents, and different speech speeds for more stable performance.
In text-to-speech, OpenAI's newest model, gpt-4o-mini-tts, allows developers to control voice style through commands such as "simulate patient customer service" or "vivid storytelling," which can be applied to customer service (synthesizing more empathetic voices to improve user experience) and creative content (personalizing voices for audiobooks or game characters). This can be applied to customer service (synthesizing more empathetic voices to improve user experience) and creative content (designing personalized voices for audiobooks or game characters).
Citing the introduction to the blog post, 1AI attached the three model costs below:
- gpt-4o-transcribe: $6 per million tokens for audio input, $2.50 per million tokens for text input, and $10 per million tokens for output, at a cost of 0.6 cents per minute.
- gpt-4o-mini-transcribe: $3 per 1 million tokens for audio input, $1.25 per 1 million tokens for text input, and $5 per 1 million tokens for output, at a cost of 0.3 cents per minute.
- gpt-4o-mini-tts: $0.60 per 1,000,000 tokens input, $12 per 1,000,000 tokens output, 1.5 cents per minute cost.