Skip to main content

Best Speech-to-Text APIs (2026)

·APIScout Team
Share:

Voice Is Eating Software

Real-time transcription. Voice agents. Meeting intelligence. Podcast search. Call center analytics. Audio content accessibility. The list of applications requiring production-grade speech-to-text has expanded dramatically in 2025-2026, and the API market has responded with genuinely impressive advances in accuracy, latency, and specialized audio intelligence features.

Three platforms lead the market for developers: Deepgram (real-time speed leader), AssemblyAI (audio intelligence and LLM integration), and OpenAI Whisper (language breadth and accuracy at scale). Each has a distinct position — the right choice depends on your use case.

TL;DR

Deepgram Nova-3 at $0.0059/minute is the fastest and cheapest for real-time voice applications (200-400ms latency, 5.26% WER). AssemblyAI at $0.37/hour leads on audio intelligence — sentiment, topic detection, auto-highlights, and the LeMUR framework for LLM-over-audio. OpenAI's gpt-4o-transcribe handles the broadest language coverage (99 languages) with the best accuracy on multilingual content. For voice agents: Deepgram. For meeting intelligence: AssemblyAI. For multilingual applications: OpenAI/Whisper.

Key Takeaways

  • Deepgram Nova-3 achieves 5.26% Word Error Rate on benchmarks with real-time streaming in 200-400ms — the fastest production STT API available.
  • AssemblyAI reduced pricing 43% to $0.37/hour and released Slam-1 (October 2025) with multilingual streaming in six languages and LLM Gateway integration.
  • OpenAI released gpt-4o-transcribe and gpt-4o-mini-transcribe in March 2025, outperforming Whisper Large-v2 on accuracy across most languages.
  • AssemblyAI's LeMUR framework applies LLMs directly to transcribed audio — summarization, Q&A, and analysis of 10+ hours of audio in a single API call.
  • Deepgram's Nova-3 Medical reaches 1-10% WER on healthcare vocabulary — the most specialized domain model in the market.
  • Real-world WER is 3-4x higher than benchmarks on challenging audio (noise, accents, jargon) — test on your actual production audio, not published benchmarks.
  • $200 free credit on Deepgram signup vs $0 free tier for OpenAI Whisper — Deepgram wins for experimentation budget.

Pricing Comparison

ProviderModelPriceBillingFree Credit
DeepgramNova-3$0.0059/min ($4.30/1K min)Per minute$200
DeepgramNova-3 Batch$0.0043/min ($3.20/1K min)Per minute$200
AssemblyAIUniversal-2$0.37/hour ($6.17/1K min)Per hourFree testing credits
OpenAIgpt-4o-transcribe$0.006/min ($6.00/1K min)Per minuteNone
OpenAIWhisper-1$0.006/min ($6.00/1K min)Per minuteNone
Google CloudStandard$0.004/minPer 15 sec$300 trial
Amazon TranscribeStandard$0.0004/sec ($0.024/min)Per secondAWS Free Tier
Azure CognitiveStandard$1.00/hourPer secondAzure credits

Cost for 1,000 hours of audio:

  • Deepgram Nova-3: ~$354
  • Deepgram Batch: ~$258
  • AssemblyAI: $370
  • OpenAI Whisper: $360
  • Amazon Transcribe: $1,440

Deepgram and AssemblyAI are nearly tied for cost on production volume. OpenAI matches. Amazon is 4x more expensive.

Deepgram

Best for: Real-time voice agents, low-latency transcription, high-volume batch processing

Deepgram is the speed and cost leader for production speech-to-text. Nova-3, their latest model, delivers 5.26% WER on benchmark audio with real-time streaming that produces words within 200-400ms of speech ending.

Models

ModelWERUse Case
Nova-35.26%General purpose, best accuracy
Nova-3 Medical1-10%Healthcare vocabulary
Nova-3 FinanceLowFinancial terminology
Whisper CloudVariableWhisper compatibility layer

Real-Time Streaming

import asyncio
import websockets
import json

async def transcribe_realtime():
    url = "wss://api.deepgram.com/v1/listen?model=nova-3&smart_format=true&diarize=true"

    async with websockets.connect(
        url,
        extra_headers={"Authorization": f"Token {DEEPGRAM_API_KEY}"}
    ) as ws:
        # Send audio chunks as they arrive
        async def send_audio():
            # Your microphone/audio source
            async for chunk in audio_source:
                await ws.send(chunk)
            await ws.send(json.dumps({"type": "CloseStream"}))

        async def receive_transcripts():
            async for message in ws:
                result = json.loads(message)
                if result.get("is_final"):
                    transcript = result["channel"]["alternatives"][0]["transcript"]
                    print(f"Final: {transcript}")
                else:
                    # Interim results for immediate display
                    interim = result["channel"]["alternatives"][0]["transcript"]
                    print(f"Interim: {interim}", end="\r")

        await asyncio.gather(send_audio(), receive_transcripts())

Batch Transcription

import httpx

response = httpx.post(
    "https://api.deepgram.com/v1/listen",
    headers={"Authorization": f"Token {DEEPGRAM_API_KEY}"},
    params={
        "model": "nova-3",
        "smart_format": "true",
        "diarize": "true",
        "punctuate": "true",
        "paragraphs": "true",
    },
    content=audio_bytes,
)

result = response.json()
transcript = result["results"]["channels"][0]["alternatives"][0]["transcript"]
words = result["results"]["channels"][0]["alternatives"][0]["words"]  # Word-level timestamps

Voice Agent Features

Deepgram's Aura TTS and Flux STT combination is specifically designed for voice agent pipelines:

  • Model-integrated end-of-turn detection (knows when user stops speaking)
  • Configurable turn-taking dynamics
  • Ultra-low latency optimized for conversation
  • Voice Activity Detection (VAD) built in

Strengths

  • Fastest real-time transcription (200-400ms latency)
  • Cheapest at scale ($0.0059/min vs $0.006 for Whisper)
  • Domain-specific models (Medical, Finance)
  • $200 free credit on signup
  • Voice agent pipeline features (Flux, Aura)
  • 36+ languages supported
  • Self-serve model customization

When to choose Deepgram

Voice agents requiring real-time transcription, high-volume batch transcription at lowest cost, healthcare/finance applications with domain-specific vocabulary, any application where latency is the primary constraint.

AssemblyAI

Best for: Audio intelligence, meeting analytics, LLM-over-audio applications

AssemblyAI's differentiation in 2026 isn't transcription accuracy — it's what you can do with transcribed audio. The LeMUR framework and their suite of audio intelligence features (sentiment analysis, topic detection, content safety, PII redaction) make AssemblyAI the choice for applications that need to understand audio, not just transcribe it.

Models

ModelWER (benchmark)Notes
Universal-28.4%General purpose, best intelligence features
Slam-1 (Oct 2025)TBDNew architecture, multilingual streaming

Audio Intelligence Features

AssemblyAI includes these features in the base transcription API:

import assemblyai as aai

config = aai.TranscriptionConfig(
    sentiment_analysis=True,        # Positive/negative/neutral per utterance
    auto_highlights=True,           # Key points automatically extracted
    iab_categories=True,            # IAB topic classification
    entity_detection=True,          # Named entity recognition
    speaker_labels=True,            # Speaker diarization
    content_safety=True,            # Hate speech, profanity detection
    redact_pii=True,                # Remove PII from transcript
    summarization=True,             # Automatic summary
    auto_chapters=True,             # Chapter segmentation
)

transcriber = aai.Transcriber()
transcript = transcriber.transcribe("https://your-audio-url.com/file.mp3", config)

# Sentiment per utterance
for result in transcript.sentiment_analysis:
    print(f"{result.speaker}: {result.text} [{result.sentiment}]")

# Auto-extracted highlights
for result in transcript.auto_highlights.results:
    print(f"Highlight: {result.text} (count: {result.count})")

LeMUR Framework

LeMUR (Leveraging Large Language Models to Understand Recognized Speech) is AssemblyAI's most distinctive feature:

# Apply LLM directly to transcribed audio
lemur_response = transcript.lemur.task(
    prompt="What were the main decisions made in this meeting? Format as a bulleted list.",
    final_model=aai.LemurModel.claude3_5_sonnet,
)

# Q&A over audio
qa_response = transcript.lemur.question_answer(
    questions=[
        aai.LemurQuestion(question="What was the total deal size discussed?"),
        aai.LemurQuestion(question="Who are the key stakeholders mentioned?"),
    ]
)

# Structured output
action_items = transcript.lemur.action_items()

Process up to 10 hours of audio through LeMUR in a single API call — summarizing hours of podcast content, extracting decisions from long recordings, or generating reports from call center sessions.

Real-Time Streaming (Slam-1)

AssemblyAI's October 2025 Slam-1 model introduced:

  • Real-time streaming transcription (latency comparable to Deepgram)
  • Six language support for streaming (English, Spanish, French, German, Portuguese, Dutch)
  • Safety guardrails during transcription
  • LLM Gateway integration for immediate LLM processing

Pricing

FeatureCost
Transcription$0.37/hour
Real-time streaming$0.37/hour
LeMUR (base)Free with transcription
LeMUR (LLM costs)Model-dependent
Audio IntelligenceIncluded

Strengths

  • Best audio intelligence suite (sentiment, topics, entities, safety)
  • LeMUR framework for LLM-over-audio
  • Content safety and PII redaction built in
  • Auto-chapters, auto-highlights, auto-summarization
  • Straightforward hourly pricing (no per-feature add-ons)
  • Free testing credits

When to choose AssemblyAI

Meeting intelligence and analytics, call center analysis, podcast intelligence, any application that needs to understand audio beyond transcription, applications requiring content moderation on audio content.

OpenAI Whisper / gpt-4o-transcribe

Best for: Language breadth, highest accuracy on multilingual audio, research/academic use

OpenAI's transcription story evolved significantly in 2025. The release of gpt-4o-transcribe (March 2025) outperforms the original Whisper Large-v2 on most benchmarks. Whisper remains available as whisper-1 for legacy integrations.

Models

ModelLanguagesWERLatencyPrice
gpt-4o-transcribe99+Low1-3s (batch)$0.006/min
gpt-4o-mini-transcribe99+GoodFasterLower
whisper-1 (legacy)99~5-7%1-3s$0.006/min

API Integration

from openai import OpenAI
client = OpenAI()

# Batch transcription
with open("audio.mp3", "rb") as audio_file:
    transcription = client.audio.transcriptions.create(
        model="gpt-4o-transcribe",
        file=audio_file,
        response_format="json",
        language="es",  # Optional: specify language for better accuracy
        timestamp_granularities=["word"],  # Word-level timestamps
    )

print(transcription.text)

Language Coverage

Whisper/gpt-4o-transcribe supports 99 languages — significantly more than Deepgram (36+) or AssemblyAI's streaming (6 languages for Slam-1). For applications handling multilingual audio from diverse user bases, OpenAI's language breadth is the decisive factor.

Limitations

  • No real-time streaming API (batch only) — gpt-4o-realtime handles real-time audio separately but at higher cost
  • No free tier — every minute costs $0.006
  • 1-3 second latency for batch — too slow for real-time voice agents
  • No audio intelligence features built in — transcription only

When to choose OpenAI Whisper/gpt-4o-transcribe

Applications requiring 99-language support, highest accuracy on challenging multilingual audio, research and academic transcription, applications already deeply in the OpenAI ecosystem, cases where batch processing (1-3s) is acceptable.

Feature Comparison

FeatureDeepgramAssemblyAIOpenAI
Real-time streamingYes (200-400ms)Yes (Slam-1)No (batch only)
Word-level timestampsYesYesYes
Speaker diarizationYesYesLimited
Sentiment analysisNoYesNo
Topic detectionNoYesNo
Entity extractionNoYesNo
Content safetyNoYesNo
PII redactionNoYesNo
Auto-summaryNoYesNo
LLM integrationNoYes (LeMUR)Basic
Language count36+6 (streaming), more (batch)99+
Domain modelsMedical, FinanceNoneNone
Free credits$200Yes (limited)None
Pricing$0.0059/min$0.37/hour$0.006/min

Choosing the Right STT API

For real-time voice applications (< 500ms latency required)

Choose Deepgram Nova-3. Nothing else delivers 200-400ms end-to-end latency for production real-time transcription. Voice agents, live captions, and interactive audio applications need Deepgram.

For meeting intelligence and audio analysis

Choose AssemblyAI. The LeMUR framework, audio intelligence features, and auto-chapters/highlights/summaries make it purpose-built for meeting analytics, podcast intelligence, and call center analysis.

For multilingual applications (> 36 languages)

Choose OpenAI gpt-4o-transcribe. 99 languages with good accuracy across all of them. Deepgram's 36 and AssemblyAI's limited streaming language support don't compare for truly multilingual applications.

For healthcare/medical applications

Choose Deepgram Nova-3 Medical. Specialized training on medical vocabulary reduces WER to 1-10% on clinical audio — significantly better than general models.

For maximum cost efficiency at batch scale

Choose Deepgram Nova-3 Batch ($0.0043/min) or Rev AI Standard ($0.002/min) if accuracy requirements are modest.

For content moderation and safety on audio

Choose AssemblyAI. Built-in content safety, profanity detection, and PII redaction are unique in the market.

Testing Recommendation

Published benchmarks use clean studio audio. Your production audio will have:

  • Background noise
  • Multiple overlapping speakers
  • Accents and non-native speech
  • Domain-specific terminology
  • Variable recording quality

Before committing, test all three APIs on 30-60 minutes of your actual production audio. WER on your data is the only metric that matters. The 5% benchmark gap between providers often becomes 2% or 20% depending on audio conditions — your use case will determine which direction.

Verdict

Deepgram is the default choice for real-time voice applications and cost-sensitive batch processing. The combination of speed, price, and the $200 free credit makes it the best starting point for most voice projects.

AssemblyAI is the right choice when transcription is just the beginning — when you need to understand, summarize, analyze, and extract structured insights from audio content.

OpenAI is the choice for maximum language coverage and applications already in the OpenAI ecosystem. The accuracy improvements in gpt-4o-transcribe are real, but the lack of real-time streaming and no free tier limit its appeal outside its strengths.

Testing Accuracy for Your Use Case

Published word error rate (WER) benchmarks for speech-to-text APIs use standardized test sets — typically clean studio recordings with standard American English or British English accents. Production audio rarely matches those conditions. Meeting recordings have crosstalk, variable microphone quality, and background noise. Call center audio has codec compression artifacts and telephony frequency filtering. Medical dictation involves specialized terminology that general-vocabulary models mishandle.

Before committing to an STT API, test it on your actual audio samples — not vendor-provided demos. A 200-sentence test set drawn from real recordings surfaces accuracy gaps that WER benchmarks mask. Key areas to test: names and proper nouns (brands, product names, people's names), domain-specific vocabulary (medical terms, legal language, technical jargon), accented speech representative of your user base, and audio with background noise at typical levels for your use case.

Deepgram and AssemblyAI both support custom vocabulary through keyword boosting and custom language models — if you're operating in a specialized domain, factor this training capability into your evaluation. OpenAI Whisper performs well on a broad vocabulary out of the box but doesn't currently support custom keyword boosting. Rev AI is strong on human-reviewed accuracy but has higher latency than async-only workflows require.

One practical metric that benchmark comparisons often omit: disfluency handling. How the API handles 'um,' 'uh,' and false starts affects transcript readability significantly. Some APIs filter filler words by default; others include them verbatim. For meeting transcription use cases where readability matters to end users, test disfluency handling explicitly alongside WER — it affects perceived quality more than marginal WER differences in the 90-95% accuracy range.


Compare speech-to-text API pricing, features, and documentation at APIScout — find the right transcription API for your application.

Evaluate Deepgram and compare alternatives on APIScout.

Related: Speech-to-Text APIs (2026), How AI Is Transforming API Design and Documentation, API Breaking Changes Without Breaking Clients

The API Integration Checklist (Free PDF)

Step-by-step checklist: auth setup, rate limit handling, error codes, SDK evaluation, and pricing comparison for 50+ APIs. Used by 200+ developers.

Join 200+ developers. Unsubscribe in one click.