AI Technology

State of the Art Speech Recognition with MAI-Transcribe-1

Meet MAI-Transcribe-1, the most accurate transcription model in the world across 25 languages.

Published: April 2, 2026

Speech is the most natural way humans communicate, often in noisy environments - conference rooms, phone lines, busy streets - across many languages. Today Microsoft is introducing MAI-Transcribe-1, a robust and efficient multilingual speech-to-text model that gives developers building global products a single model that scales well across languages, accents, and production environments. MAI-Transcribe-1 is now available on Microsoft Foundry.

Best-in-class accuracy on FLEURS

MAI-Transcribe-1 achieves the lowest Word Error Rate against competitive speech-to-text models. On FLEURS (25 languages), it outperforms Scribe v2, Whisper-large-V3, GPT-Transcribe, and Gemini 3.1 Flash-Lite.

Word Error Rate Comparison (Lower is Better)

MAI-Transcribe-1

4.2%

Scribe v2

5.1%

Whisper-large-V3

5.8%

GPT-Transcribe

6.2%

Gemini 3.1 Flash-Lite

7.1%

World class quality across 25 languages

The model maintains competitively high accuracy across all 25 supported languages, making it adaptable for global products and resilient to a wide range of accents or speaking styles.

Incredible speed and efficiency

MAI-Transcribe-1 delivers incredible batch transcription speeds 2.5x faster than Microsoft Azure Fast offering. Speed and efficiency are essential for all production workloads, and Microsoft has worked incredibly hard to ensure lightning fast performance whilst maintaining state-of-the-art accuracy across 25 languages.

2.5x

Faster than Azure Fast Transcription

Outstanding performance in noisy environments

Benchmarks are only part of the story. When it comes to production use cases such as voice agents, meeting transcription, and call center analytics, audio is rarely clean. MAI-Transcribe-1 was built with challenging recording conditions in mind, reliably handling background noise, low-quality audio recordings, and overlapping speech.

Conference Rooms

Handles background chatter, side conversations, and ambient noise from air conditioning and ventilation systems.

Phone Calls

Performs reliably on low-quality phone audio, mobile connections, and compressed formats.

Best price-to-performance of any large cloud provider

Microsoft is passing efficiency gains directly to customers: MAI-Transcribe-1 is priced at $0.36 per hour of audio, setting the standard for quality, speed, and price for production ASR.

$0.36/hour

Industry-leading price-to-performance

Powering Microsoft Products

MAI-Transcribe-1 is in phased rollouts with Copilot's Voice mode and Microsoft Teams to provide accurate conversation transcripts that can be used for various downstream tasks.

Applications

Offline Applications

MAI-Transcribe-1 supports a wide range of applications, from media tasks such as subtitle generation, podcast transcription, and video accessibility, to enterprise needs such as meeting archives, compliance recording, and legal discovery. It also powers analytics workflows including call center QA, customer insight extraction, and searchable audio libraries.

Online Applications

Low latency also makes MAI-Transcribe-1 a good choice for real-time tasks including meeting transcription, video close captioning, and dictation.

Voice Agents: The Complete Stack

If you're building a voice agent, MAI-Transcribe-1 is the foundational layer. Accurate transcription is what allows underlying LLMs to interpret intent effectively. It directly shapes user satisfaction and task completion rates.

TakeNote is powered by MAI-Transcribe-1

We chose MAI-Transcribe-1 as the foundation for TakeNote because of its unmatched accuracy across UK English accents, exceptional performance in noisy meeting environments, and enterprise-grade reliability. When combined with our FCA-compliant UK data residency, it delivers the accuracy financial advisers need for regulatory compliance.

Read the full article on Microsoft AI