Best Speech-to-Text APIs (2026)

Voice Is Eating Software

Real-time transcription. Voice agents. Meeting intelligence. Podcast search. Call center analytics. Audio content accessibility. The list of applications requiring production-grade speech-to-text has expanded dramatically in 2025-2026, and the API market has responded with genuinely impressive advances in accuracy, latency, and specialized audio intelligence features.

Three platforms lead the market for developers: Deepgram (real-time speed leader), AssemblyAI (audio intelligence and LLM integration), and OpenAI Whisper (language breadth and accuracy at scale). Each has a distinct position — the right choice depends on your use case.

TL;DR

Deepgram Nova-3 at $0.0059/minute is the fastest and cheapest for real-time voice applications (200-400ms latency, 5.26% WER). AssemblyAI at $0.37/hour leads on audio intelligence — sentiment, topic detection, auto-highlights, and the LeMUR framework for LLM-over-audio. OpenAI's gpt-4o-transcribe handles the broadest language coverage (99 languages) with the best accuracy on multilingual content. For voice agents: Deepgram. For meeting intelligence: AssemblyAI. For multilingual applications: OpenAI/Whisper.

Key Takeaways

Deepgram Nova-3 achieves 5.26% Word Error Rate on benchmarks with real-time streaming in 200-400ms — the fastest production STT API available.
AssemblyAI reduced pricing 43% to $0.37/hour and released Slam-1 (October 2025) with multilingual streaming in six languages and LLM Gateway integration.
OpenAI released gpt-4o-transcribe and gpt-4o-mini-transcribe in March 2025, outperforming Whisper Large-v2 on accuracy across most languages.
AssemblyAI's LeMUR framework applies LLMs directly to transcribed audio — summarization, Q&A, and analysis of 10+ hours of audio in a single API call.
Deepgram's Nova-3 Medical reaches 1-10% WER on healthcare vocabulary — the most specialized domain model in the market.
Real-world WER is 3-4x higher than benchmarks on challenging audio (noise, accents, jargon) — test on your actual production audio, not published benchmarks.
$200 free credit on Deepgram signup vs $0 free tier for OpenAI Whisper — Deepgram wins for experimentation budget.

Pricing Comparison

Provider	Model	Price	Billing	Free Credit
Deepgram	Nova-3	$0.0059/min ($4.30/1K min)	Per minute	$200
Deepgram	Nova-3 Batch	$0.0043/min ($3.20/1K min)	Per minute	$200
AssemblyAI	Universal-2	$0.37/hour ($6.17/1K min)	Per hour	Free testing credits
OpenAI	gpt-4o-transcribe	$0.006/min ($6.00/1K min)	Per minute	None
OpenAI	Whisper-1	$0.006/min ($6.00/1K min)	Per minute	None
Google Cloud	Standard	$0.004/min	Per 15 sec	$300 trial
Amazon Transcribe	Standard	$0.0004/sec ($0.024/min)	Per second	AWS Free Tier
Azure Cognitive	Standard	$1.00/hour	Per second	Azure credits

Cost for 1,000 hours of audio:

Deepgram Nova-3: ~$354
Deepgram Batch: ~$258
AssemblyAI: $370
OpenAI Whisper: $360
Amazon Transcribe: $1,440

Deepgram and AssemblyAI are nearly tied for cost on production volume. OpenAI matches. Amazon is 4x more expensive.

Deepgram

Best for: Real-time voice agents, low-latency transcription, high-volume batch processing

Deepgram is the speed and cost leader for production speech-to-text. Nova-3, their latest model, delivers 5.26% WER on benchmark audio with real-time streaming that produces words within 200-400ms of speech ending.

Models

Model	WER	Use Case
Nova-3	5.26%	General purpose, best accuracy
Nova-3 Medical	1-10%	Healthcare vocabulary
Nova-3 Finance	Low	Financial terminology
Whisper Cloud	Variable	Whisper compatibility layer

Real-Time Streaming

import asyncio
import websockets
import json

async def transcribe_realtime():
    url = "wss://api.deepgram.com/v1/listen?model=nova-3&smart_format=true&diarize=true"

    async with websockets.connect(
        url,
        extra_headers={"Authorization": f"Token {DEEPGRAM_API_KEY}"}
    ) as ws:
        # Send audio chunks as they arrive
        async def send_audio():
            # Your microphone/audio source
            async for chunk in audio_source:
                await ws.send(chunk)
            await ws.send(json.dumps({"type": "CloseStream"}))

        async def receive_transcripts():
            async for message in ws:
                result = json.loads(message)
                if result.get("is_final"):
                    transcript = result["channel"]["alternatives"][0]["transcript"]
                    print(f"Final: {transcript}")
                else:
                    # Interim results for immediate display
                    interim = result["channel"]["alternatives"][0]["transcript"]
                    print(f"Interim: {interim}", end="\r")

        await asyncio.gather(send_audio(), receive_transcripts())

Batch Transcription

import httpx

response = httpx.post(
    "https://api.deepgram.com/v1/listen",
    headers={"Authorization": f"Token {DEEPGRAM_API_KEY}"},
    params={
        "model": "nova-3",
        "smart_format": "true",
        "diarize": "true",
        "punctuate": "true",
        "paragraphs": "true",
    },
    content=audio_bytes,
)

result = response.json()
transcript = result["results"]["channels"][0]["alternatives"][0]["transcript"]
words = result["results"]["channels"][0]["alternatives"][0]["words"]  # Word-level timestamps

Voice Agent Features

Deepgram's Aura TTS and Flux STT combination is specifically designed for voice agent pipelines:

Model-integrated end-of-turn detection (knows when user stops speaking)
Configurable turn-taking dynamics
Ultra-low latency optimized for conversation
Voice Activity Detection (VAD) built in

Strengths

Fastest real-time transcription (200-400ms latency)
Cheapest at scale ($0.0059/min vs $0.006 for Whisper)
Domain-specific models (Medical, Finance)
$200 free credit on signup
Voice agent pipeline features (Flux, Aura)
36+ languages supported
Self-serve model customization

When to choose Deepgram

Voice agents requiring real-time transcription, high-volume batch transcription at lowest cost, healthcare/finance applications with domain-specific vocabulary, any application where latency is the primary constraint.

AssemblyAI

Best for: Audio intelligence, meeting analytics, LLM-over-audio applications

AssemblyAI's differentiation in 2026 isn't transcription accuracy — it's what you can do with transcribed audio. The LeMUR framework and their suite of audio intelligence features (sentiment analysis, topic detection, content safety, PII redaction) make AssemblyAI the choice for applications that need to understand audio, not just transcribe it.

Models

Model	WER (benchmark)	Notes
Universal-2	8.4%	General purpose, best intelligence features
Slam-1 (Oct 2025)	TBD	New architecture, multilingual streaming

Audio Intelligence Features

AssemblyAI includes these features in the base transcription API:

import assemblyai as aai

config = aai.TranscriptionConfig(
    sentiment_analysis=True,        # Positive/negative/neutral per utterance
    auto_highlights=True,           # Key points automatically extracted
    iab_categories=True,            # IAB topic classification
    entity_detection=True,          # Named entity recognition
    speaker_labels=True,            # Speaker diarization
    content_safety=True,            # Hate speech, profanity detection
    redact_pii=True,                # Remove PII from transcript
    summarization=True,             # Automatic summary
    auto_chapters=True,             # Chapter segmentation
)

transcriber = aai.Transcriber()
transcript = transcriber.transcribe("https://your-audio-url.com/file.mp3", config)

# Sentiment per utterance
for result in transcript.sentiment_analysis:
    print(f"{result.speaker}: {result.text} [{result.sentiment}]")

# Auto-extracted highlights
for result in transcript.auto_highlights.results:
    print(f"Highlight: {result.text} (count: {result.count})")

LeMUR Framework

LeMUR (Leveraging Large Language Models to Understand Recognized Speech) is AssemblyAI's most distinctive feature:

# Apply LLM directly to transcribed audio
lemur_response = transcript.lemur.task(
    prompt="What were the main decisions made in this meeting? Format as a bulleted list.",
    final_model=aai.LemurModel.claude3_5_sonnet,
)

# Q&A over audio
qa_response = transcript.lemur.question_answer(
    questions=[
        aai.LemurQuestion(question="What was the total deal size discussed?"),
        aai.LemurQuestion(question="Who are the key stakeholders mentioned?"),
    ]
)

# Structured output
action_items = transcript.lemur.action_items()

Process up to 10 hours of audio through LeMUR in a single API call — summarizing hours of podcast content, extracting decisions from long recordings, or generating reports from call center sessions.

Real-Time Streaming (Slam-1)

AssemblyAI's October 2025 Slam-1 model introduced:

Real-time streaming transcription (latency comparable to Deepgram)
Six language support for streaming (English, Spanish, French, German, Portuguese, Dutch)
Safety guardrails during transcription
LLM Gateway integration for immediate LLM processing

Pricing

Feature	Cost
Transcription	$0.37/hour
Real-time streaming	$0.37/hour
LeMUR (base)	Free with transcription
LeMUR (LLM costs)	Model-dependent
Audio Intelligence	Included

Strengths

Best audio intelligence suite (sentiment, topics, entities, safety)
LeMUR framework for LLM-over-audio
Content safety and PII redaction built in
Auto-chapters, auto-highlights, auto-summarization
Straightforward hourly pricing (no per-feature add-ons)
Free testing credits

When to choose AssemblyAI

Meeting intelligence and analytics, call center analysis, podcast intelligence, any application that needs to understand audio beyond transcription, applications requiring content moderation on audio content.

OpenAI Whisper / gpt-4o-transcribe

Best for: Language breadth, highest accuracy on multilingual audio, research/academic use

OpenAI's transcription story evolved significantly in 2025. The release of gpt-4o-transcribe (March 2025) outperforms the original Whisper Large-v2 on most benchmarks. Whisper remains available as whisper-1 for legacy integrations.

Models

Model	Languages	WER	Latency	Price
gpt-4o-transcribe	99+	Low	1-3s (batch)	$0.006/min
gpt-4o-mini-transcribe	99+	Good	Faster	Lower
whisper-1 (legacy)	99	~5-7%	1-3s	$0.006/min

API Integration

from openai import OpenAI
client = OpenAI()

# Batch transcription
with open("audio.mp3", "rb") as audio_file:
    transcription = client.audio.transcriptions.create(
        model="gpt-4o-transcribe",
        file=audio_file,
        response_format="json",
        language="es",  # Optional: specify language for better accuracy
        timestamp_granularities=["word"],  # Word-level timestamps
    )

print(transcription.text)

Language Coverage

Whisper/gpt-4o-transcribe supports 99 languages — significantly more than Deepgram (36+) or AssemblyAI's streaming (6 languages for Slam-1). For applications handling multilingual audio from diverse user bases, OpenAI's language breadth is the decisive factor.

Limitations

No real-time streaming API (batch only) — gpt-4o-realtime handles real-time audio separately but at higher cost
No free tier — every minute costs $0.006
1-3 second latency for batch — too slow for real-time voice agents
No audio intelligence features built in — transcription only

When to choose OpenAI Whisper/gpt-4o-transcribe

Applications requiring 99-language support, highest accuracy on challenging multilingual audio, research and academic transcription, applications already deeply in the OpenAI ecosystem, cases where batch processing (1-3s) is acceptable.

Feature Comparison

Feature	Deepgram	AssemblyAI	OpenAI
Real-time streaming	Yes (200-400ms)	Yes (Slam-1)	No (batch only)
Word-level timestamps	Yes	Yes	Yes
Speaker diarization	Yes	Yes	Limited
Sentiment analysis	No	Yes	No
Topic detection	No	Yes	No
Entity extraction	No	Yes	No
Content safety	No	Yes	No
PII redaction	No	Yes	No
Auto-summary	No	Yes	No
LLM integration	No	Yes (LeMUR)	Basic
Language count	36+	6 (streaming), more (batch)	99+
Domain models	Medical, Finance	None	None
Free credits	$200	Yes (limited)	None
Pricing	$0.0059/min	$0.37/hour	$0.006/min

Choosing the Right STT API

For real-time voice applications (< 500ms latency required)

Choose Deepgram Nova-3. Nothing else delivers 200-400ms end-to-end latency for production real-time transcription. Voice agents, live captions, and interactive audio applications need Deepgram.

For meeting intelligence and audio analysis

Choose AssemblyAI. The LeMUR framework, audio intelligence features, and auto-chapters/highlights/summaries make it purpose-built for meeting analytics, podcast intelligence, and call center analysis.

For multilingual applications (> 36 languages)

Choose OpenAI gpt-4o-transcribe. 99 languages with good accuracy across all of them. Deepgram's 36 and AssemblyAI's limited streaming language support don't compare for truly multilingual applications.

For healthcare/medical applications

Choose Deepgram Nova-3 Medical. Specialized training on medical vocabulary reduces WER to 1-10% on clinical audio — significantly better than general models.

For maximum cost efficiency at batch scale

Choose Deepgram Nova-3 Batch ($0.0043/min) or Rev AI Standard ($0.002/min) if accuracy requirements are modest.

For content moderation and safety on audio

Choose AssemblyAI. Built-in content safety, profanity detection, and PII redaction are unique in the market.

Testing Recommendation

Published benchmarks use clean studio audio. Your production audio will have:

Background noise
Multiple overlapping speakers
Accents and non-native speech
Domain-specific terminology
Variable recording quality

Before committing, test all three APIs on 30-60 minutes of your actual production audio. WER on your data is the only metric that matters. The 5% benchmark gap between providers often becomes 2% or 20% depending on audio conditions — your use case will determine which direction.

Verdict

Deepgram is the default choice for real-time voice applications and cost-sensitive batch processing. The combination of speed, price, and the $200 free credit makes it the best starting point for most voice projects.

AssemblyAI is the right choice when transcription is just the beginning — when you need to understand, summarize, analyze, and extract structured insights from audio content.

OpenAI is the choice for maximum language coverage and applications already in the OpenAI ecosystem. The accuracy improvements in gpt-4o-transcribe are real, but the lack of real-time streaming and no free tier limit its appeal outside its strengths.

Testing Accuracy for Your Use Case

Published word error rate (WER) benchmarks for speech-to-text APIs use standardized test sets — typically clean studio recordings with standard American English or British English accents. Production audio rarely matches those conditions. Meeting recordings have crosstalk, variable microphone quality, and background noise. Call center audio has codec compression artifacts and telephony frequency filtering. Medical dictation involves specialized terminology that general-vocabulary models mishandle.

Before committing to an STT API, test it on your actual audio samples — not vendor-provided demos. A 200-sentence test set drawn from real recordings surfaces accuracy gaps that WER benchmarks mask. Key areas to test: names and proper nouns (brands, product names, people's names), domain-specific vocabulary (medical terms, legal language, technical jargon), accented speech representative of your user base, and audio with background noise at typical levels for your use case.

Deepgram and AssemblyAI both support custom vocabulary through keyword boosting and custom language models — if you're operating in a specialized domain, factor this training capability into your evaluation. OpenAI Whisper performs well on a broad vocabulary out of the box but doesn't currently support custom keyword boosting. Rev AI is strong on human-reviewed accuracy but has higher latency than async-only workflows require.

One practical metric that benchmark comparisons often omit: disfluency handling. How the API handles 'um,' 'uh,' and false starts affects transcript readability significantly. Some APIs filter filler words by default; others include them verbatim. For meeting transcription use cases where readability matters to end users, test disfluency handling explicitly alongside WER — it affects perceived quality more than marginal WER differences in the 90-95% accuracy range.

Compare speech-to-text API pricing, features, and documentation at APIScout — find the right transcription API for your application.

Evaluate Deepgram and compare alternatives on APIScout.

The API Integration Checklist (Free PDF)