NVIDIA dropped an audio AI that can think–about music, conversation and more. Mistral again into picture for another efficient open-source audio models. Boston University trained AI model on medical podcasts.
Audio Flamingo 3: an open-source audio language model
Audio Flamingo 3, a fully open-source Large Audio Language Model (LALM) with state-of-the-art performance in audio understanding and reasoning across 20+ benchmarks.

Some of the novel capabilities are:
- Including multi-turn, multi-audio chat
- On-demand thinking
- Voice-to-Voice interaction
- Long-context audio reasoning (up to 10 minutes).
Audio | Instruction | Ground-truth | AF3 Think |
---|---|---|---|
What sport are the people in the audio doing? (A) Running (B) Playing badminton (C) Playing football (D) Playing tennis |
(D) Playing tennis | The audio contains sounds of a ball being hit and bounced, along with the sound of shoes squeaking on a hard surface. These sounds are characteristic of a sport played on a hard court with a racket or shuttlecock. Tennis and badminton are both such sports. The presence of a ball being hit and bounced suggests a sport where a racket is used to hit a ball, which is more indicative of tennis than badminton. Therefore, the most likely sport is tennis. |
Voxtral: an open-source text-to-speech model
Open, affordable, and production-ready speech understanding for everyone. Until recently, gaining truly usable speech intelligence in production meant choosing between two trade-offs:
- Open-source ASR systems with high word error rates and limited semantic understanding
- Closed, proprietary APIs that combine strong transcription with language understanding, but at significantly higher cost and with less control over deployment.

These capabilities make the Voxtral models ideal for real-world interactions and downstream actions, such as summaries, answers, analysis, and insights. For cost-sensitive use-cases, Voxtral Mini Transcribe outperforms OpenAI Whisper for less than half the price. For premium use cases, Voxtral Small matches the performance of ElevenLabs Scribe, also for less than half the price.
Audio | Transcribe |
---|---|
देखो यार, वो client meeting आज postpone करनी पड़ेगी, क्योंकि मेरा calendar एकदम full है, तुम please उन्हें mail करके reschedule कर लो, और वो updated agenda भी share कर देना, urgently, ठीक है? |
PodGPT: AI model learns from science podcasts
Boston University introduce a newly created computer program called PodGPT that learns from science and medicine podcasts to become smarter at understanding and answering scientific questions. The work is published in the journal npj Biomedical Innovations.
“We wanted to go beyond the written word,” said Kolachama. “By integrating spoken content, we aim to enhance our model’s understanding of conversational language and extend its application to more specialized contexts within STEMM [science, technology, engineering, mathematics and medicine] disciplines.”
More than 3700 hours of audio were transcribed into 42 million text tokens, allowing the model to absorb and learn from diverse expert discussions across multiple fields. By integrating spoken content, we aimed to enhance the model’s understanding of conversational language and extend its application to more specialized contexts within STEMM disciplines.
Wondering how the latest open-source audio models can supercharge your AI solutions? If your roadmap includes smart voice assistants, healthcare transcription, or dynamic multilingual chat services, our specialists can smoothly integrate and fine-tune these state-of-the-art models for your entire product architecture.
Reach out to us today and we’ll help you deliver the future of audio AI to your customers—while you maintain full control, scale effortlessly, and keep costs optimized.