State of the Art Speech Recognition with MAI-Transcribe-1
Meet MAI-Transcribe-1, the most accurate transcription model in the world across 25 languages.
Speech is the most natural way humans communicate, often in noisy environments - conference rooms, phone lines, busy streets - across many languages. Today Microsoft is introducing MAI-Transcribe-1, a robust and efficient multilingual speech-to-text model that gives developers building global products a single model that scales well across languages, accents, and production environments. MAI-Transcribe-1 is now available on Microsoft Foundry.
Best-in-class accuracy on FLEURS
MAI-Transcribe-1 achieves the lowest Word Error Rate against competitive speech-to-text models. On FLEURS (25 languages), it outperforms Scribe v2, Whisper-large-V3, GPT-Transcribe, and Gemini 3.1 Flash-Lite.
Word Error Rate Comparison (Lower is Better)
World class quality across 25 languages
The model maintains competitively high accuracy across all 25 supported languages, making it adaptable for global products and resilient to a wide range of accents or speaking styles.
Incredible speed and efficiency
MAI-Transcribe-1 delivers incredible batch transcription speeds 2.5x faster than Microsoft Azure Fast offering. Speed and efficiency are essential for all production workloads, and Microsoft has worked incredibly hard to ensure lightning fast performance whilst maintaining state-of-the-art accuracy across 25 languages.
Outstanding performance in noisy environments
Benchmarks are only part of the story. When it comes to production use cases such as voice agents, meeting transcription, and call center analytics, audio is rarely clean. MAI-Transcribe-1 was built with challenging recording conditions in mind, reliably handling background noise, low-quality audio recordings, and overlapping speech.
Conference Rooms
Handles background chatter, side conversations, and ambient noise from air conditioning and ventilation systems.
Phone Calls
Performs reliably on low-quality phone audio, mobile connections, and compressed formats.
Best price-to-performance of any large cloud provider
Microsoft is passing efficiency gains directly to customers: MAI-Transcribe-1 is priced at $0.36 per hour of audio, setting the standard for quality, speed, and price for production ASR.
Powering Microsoft Products
MAI-Transcribe-1 is in phased rollouts with Copilot's Voice mode and Microsoft Teams to provide accurate conversation transcripts that can be used for various downstream tasks.
Applications
Offline Applications
MAI-Transcribe-1 supports a wide range of applications, from media tasks such as subtitle generation, podcast transcription, and video accessibility, to enterprise needs such as meeting archives, compliance recording, and legal discovery. It also powers analytics workflows including call center QA, customer insight extraction, and searchable audio libraries.
Online Applications
Low latency also makes MAI-Transcribe-1 a good choice for real-time tasks including meeting transcription, video close captioning, and dictation.
Voice Agents: The Complete Stack
If you're building a voice agent, MAI-Transcribe-1 is the foundational layer. Accurate transcription is what allows underlying LLMs to interpret intent effectively. It directly shapes user satisfaction and task completion rates.
TakeNote is powered by MAI-Transcribe-1
We chose MAI-Transcribe-1 as the foundation for TakeNote because of its unmatched accuracy across UK English accents, exceptional performance in noisy meeting environments, and enterprise-grade reliability. When combined with our FCA-compliant UK data residency, it delivers the accuracy financial advisers need for regulatory compliance.
