ElevenLabs vs OpenAI TTS vs Deepgram Aura 2026
TL;DR
ElevenLabs for voice quality and cloning. OpenAI TTS for simplicity and ecosystem. Deepgram Aura for production-grade low-latency at scale. ElevenLabs produces the most natural-sounding speech and is the only API with high-quality voice cloning from 1 minute of audio. OpenAI TTS is good enough for most use cases and has the simplest API. Deepgram Aura wins on first-byte latency (~200ms) which matters for real-time voice apps. The right choice depends on whether you're building a voice product (ElevenLabs), an AI assistant (Deepgram), or just adding audio to your app (OpenAI).
Text-to-speech has crossed from "novelty" to "infrastructure" in 2026. AI assistants, voice-enabled web apps, accessibility features, audiobook generators, and customer service bots all depend on TTS quality and reliability. Choosing the wrong provider means either overpaying (ElevenLabs at 20x OpenAI's price) or compromising on quality at exactly the moment users notice (robot-sounding voice for premium voice products). This guide gives you the code and decision framework to choose correctly.
Key Takeaways
- ElevenLabs: best voice quality, voice cloning, 32 languages, $0.30/1K chars ($0.0003/char)
- OpenAI TTS: 6 voices, simple API, $15/1M chars ($0.000015/char) — 20x cheaper
- Deepgram Aura: ~200ms first byte, streaming WebSocket, $0.015/1K chars
- Latency for streaming: Deepgram ~200ms, OpenAI ~400ms, ElevenLabs ~500ms (streaming)
- Voice cloning: ElevenLabs only (30-second to 1-minute sample needed)
- Real-time voice: Deepgram Aura + STT in same platform = low-latency voice assistant loop
OpenAI TTS: Simplest API
Best for: adding audio to an existing OpenAI app, simple narration, notifications
import OpenAI from 'openai';
import fs from 'fs';
const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });
// Generate audio file:
const mp3 = await openai.audio.speech.create({
model: 'tts-1', // or 'tts-1-hd' (higher quality, ~2x cost)
voice: 'alloy', // alloy, ash, coral, echo, fable, nova, onyx, sage, shimmer
input: 'Hello! Welcome to our platform.',
response_format: 'mp3', // mp3, opus, aac, flac, wav, pcm
speed: 1.0, // 0.25 to 4.0
});
// Save to file:
const buffer = Buffer.from(await mp3.arrayBuffer());
fs.writeFileSync('speech.mp3', buffer);
// Streaming (for long text):
const stream = await openai.audio.speech.create({
model: 'tts-1',
voice: 'nova',
input: longText,
response_format: 'mp3',
});
// Stream to file or HTTP response:
const dest = fs.createWriteStream('speech.mp3');
const readableStream = stream.body as unknown as NodeJS.ReadableStream;
readableStream.pipe(dest);
// Next.js API route — stream audio to browser:
export async function POST(req: Request) {
const { text, voice = 'alloy' } = await req.json();
const mp3 = await openai.audio.speech.create({
model: 'tts-1',
voice,
input: text,
});
return new Response(mp3.body, {
headers: {
'Content-Type': 'audio/mpeg',
'Transfer-Encoding': 'chunked',
},
});
}
OpenAI TTS Voices
| Voice | Character | Best For |
|---|---|---|
| alloy | Neutral, balanced | General use |
| ash | Warm, conversational | Chatbots |
| coral | Clear, professional | Announcements |
| echo | Deep, authoritative | Presentations |
| nova | Bright, friendly | Customer service |
| onyx | Rich, deep | Narration |
| sage | Calm, clear | Education |
| shimmer | Warm, expressive | Stories |
OpenAI TTS Pricing
tts-1: $15/1M characters ($0.000015/char)
tts-1-hd: $30/1M characters ($0.000030/char)
For context:
Average sentence (100 chars): $0.0015
1-minute narration (~800 chars): $0.012
1-hour audiobook (~48K chars): $0.72
ElevenLabs: Premium Quality and Voice Cloning
Best for: high-quality voice content, voice cloning, multilingual apps, podcasts/audiobooks
// npm install elevenlabs
import { ElevenLabsClient } from 'elevenlabs';
const elevenlabs = new ElevenLabsClient({
apiKey: process.env.ELEVENLABS_API_KEY,
});
// Convert text to speech:
const audioStream = await elevenlabs.textToSpeech.convertAsStream('Rachel', {
text: 'Welcome to our platform.',
model_id: 'eleven_turbo_v2', // Fast model; 'eleven_multilingual_v2' for multilingual
voice_settings: {
stability: 0.5, // 0-1: lower = more expressive, higher = more consistent
similarity_boost: 0.8, // 0-1: higher = more similar to original voice
style: 0.0, // 0-1: style exaggeration
use_speaker_boost: true,
},
output_format: 'mp3_44100_128',
});
// Collect stream into buffer:
const chunks: Uint8Array[] = [];
for await (const chunk of audioStream) {
chunks.push(chunk);
}
const audio = Buffer.concat(chunks);
fs.writeFileSync('speech.mp3', audio);
// List available voices:
const voices = await elevenlabs.voices.getAll();
for (const voice of voices.voices) {
console.log(`${voice.name}: ${voice.voice_id} (${voice.labels?.accent ?? 'no accent'})`);
}
// Use a specific voice by ID:
const audioStream = await elevenlabs.textToSpeech.convertAsStream(
'pNInz6obpgDQGcFmaJgB', // Adam voice ID
{ text: 'Hello world', model_id: 'eleven_turbo_v2' }
);
ElevenLabs Voice Cloning
// Instant Voice Cloning (1 minute of audio → custom voice):
const voiceClone = await elevenlabs.voices.ivc.create({
name: 'My Custom Voice',
description: 'A custom voice for our product',
files: [
new File([fs.readFileSync('voice-sample.mp3')], 'sample.mp3', { type: 'audio/mpeg' }),
],
labels: JSON.stringify({ accent: 'American', age: 'young adult', gender: 'female' }),
});
console.log('Voice ID:', voiceClone.voice_id);
// Now use the cloned voice:
const audio = await elevenlabs.textToSpeech.convertAsStream(voiceClone.voice_id, {
text: 'This is my cloned voice.',
model_id: 'eleven_turbo_v2',
});
ElevenLabs Models
| Model | Quality | Latency | Languages | Notes |
|---|---|---|---|---|
eleven_turbo_v2 | Good | ~500ms | 32 | Best balance |
eleven_turbo_v2_5 | Better | ~500ms | 32 | Improved quality |
eleven_multilingual_v2 | Best | ~800ms | 29 | Highest quality |
eleven_flash_v2_5 | Good | ~200ms | 32 | Lowest latency |
ElevenLabs Pricing
Creator ($22/month): 100K chars/month
Pro ($99/month): 500K chars/month + commercial use + voice cloning
Scale ($330/month): 2M chars/month
Enterprise: custom
Beyond plan limit:
Developer: $0.30/1K chars ($0.0003/char)
vs OpenAI: $0.015/1K chars ($0.000015/char)
ElevenLabs is 20x more expensive than OpenAI TTS.
Worth it for: high-quality consumer products, voice assistants, content creation.
Not worth it for: internal tools, simple notifications, cost-sensitive apps.
Deepgram Aura: Low-Latency Production
Best for: real-time voice assistants, customer service bots, apps needing fast first-byte response
// npm install @deepgram/sdk
import { createClient } from '@deepgram/sdk';
const deepgram = createClient(process.env.DEEPGRAM_API_KEY!);
// TTS with streaming:
const response = await deepgram.speak.request(
{ text: 'Hello! How can I help you today?' },
{
model: 'aura-asteria-en', // Fastest English voice
encoding: 'linear16', // PCM16 for real-time playback
sample_rate: 24000,
}
);
const stream = await response.getStream();
if (stream) {
// First bytes arrive in ~200ms (vs ~400ms for OpenAI)
const audioData = await getAudioBuffer(stream);
}
// WebSocket for true real-time (lower overhead than HTTP):
const ws = deepgram.speak.live({
model: 'aura-asteria-en',
encoding: 'linear16',
sample_rate: 24000,
});
ws.on('open', () => {
ws.sendText('Hello! I am your AI assistant.');
});
ws.on('audio', (audioChunk: Buffer) => {
// Stream each chunk to audio output immediately
playAudioChunk(audioChunk);
});
ws.on('close', () => console.log('Done'));
Deepgram Aura Voices
English voices:
aura-asteria-en — Female, warm (recommended)
aura-luna-en — Female, natural
aura-stella-en — Female, clear
aura-athena-en — Female, authoritative
aura-hera-en — Female, confident
aura-orion-en — Male, deep
aura-arcas-en — Male, warm
aura-perseus-en — Male, authoritative
aura-angus-en — Male, Irish accent
aura-orpheus-en — Male, American
aura-helios-en — Male, British accent
aura-zeus-en — Male, commanding
Deepgram Full Voice Pipeline (STT + TTS)
The killer use case: Deepgram handles both speech-to-text and TTS, minimizing round trips:
// Complete voice assistant loop with Deepgram:
import { createClient, LiveTranscriptionEvents } from '@deepgram/sdk';
const deepgram = createClient(process.env.DEEPGRAM_API_KEY!);
// 1. Speech-to-Text (Deepgram Nova-3):
const sttConnection = deepgram.listen.live({
model: 'nova-3',
language: 'en-US',
smart_format: true,
interim_results: false,
utterance_end_ms: 1000,
});
sttConnection.on(LiveTranscriptionEvents.Transcript, async (data) => {
const transcript = data.channel?.alternatives[0]?.transcript;
if (!transcript || data.is_final === false) return;
// 2. Send to LLM:
const llmResponse = await getLLMResponse(transcript);
// 3. Text-to-Speech (Deepgram Aura):
const ttsResponse = await deepgram.speak.request(
{ text: llmResponse },
{ model: 'aura-asteria-en', encoding: 'linear16', sample_rate: 24000 }
);
const audioStream = await ttsResponse.getStream();
// Play audio immediately
playStream(audioStream);
});
Deepgram Pricing
Aura TTS:
Pay-as-you-go: $0.015/1K chars ($0.000015/char)
Same price as OpenAI TTS-1 but with better latency
For 1M chars/month:
OpenAI TTS-1: $15
Deepgram Aura: $15
ElevenLabs: $300 (on pay-as-you-go)
Head-to-Head: When to Choose
| Need | Best Choice |
|---|---|
| Simplest setup | OpenAI TTS |
| Highest voice quality | ElevenLabs |
| Voice cloning | ElevenLabs |
| Lowest cost | OpenAI TTS or Deepgram |
| Fastest first-byte | Deepgram Aura or ElevenLabs Flash |
| Multilingual (29+ languages) | ElevenLabs |
| Full STT+TTS pipeline | Deepgram |
| Already using OpenAI | OpenAI TTS |
| Real-time voice assistant | Deepgram Aura |
| Consumer product (quality matters) | ElevenLabs |
| Internal tool | OpenAI TTS |
Code: Audio Playback in the Browser
// Play audio from TTS API in browser:
async function playText(text: string) {
const response = await fetch('/api/tts', {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({ text }),
});
const arrayBuffer = await response.arrayBuffer();
const audioContext = new AudioContext();
// For MP3:
const audioBuffer = await audioContext.decodeAudioData(arrayBuffer);
const source = audioContext.createBufferSource();
source.buffer = audioBuffer;
source.connect(audioContext.destination);
source.start();
// Returns promise when done playing:
return new Promise<void>((resolve) => {
source.onended = () => resolve();
});
}
// Streaming playback (start playing before download complete):
async function playStreamingText(text: string) {
const response = await fetch('/api/tts', {
method: 'POST',
body: JSON.stringify({ text }),
headers: { 'Content-Type': 'application/json' },
});
const audioContext = new AudioContext();
const reader = response.body!.getReader();
// ... chunk-by-chunk audio playback
}
Real-Time Voice Architecture
Building a real-time voice assistant (user speaks → AI responds in voice) requires careful latency budgeting. The full loop has four stages, each with its own latency contribution:
- Speech-to-text (STT): Transcribing user speech — typically 200-500ms for streaming STT. Deepgram Nova-3, OpenAI Whisper Streaming, and AssemblyAI Streaming all target this range.
- LLM inference: Processing the transcription and generating a response. Claude Haiku, GPT-4o-mini, and Gemini Flash are optimized for low-latency completions — 300-600ms for short responses with streaming.
- Text-to-speech: Converting the LLM response to audio. This is where TTS provider choice has the biggest impact: Deepgram Aura at ~200ms TTFB vs OpenAI TTS at ~400ms.
- Audio delivery: Streaming audio to the client — typically negligible if you start playback on first chunk.
Total round-trip: 700ms-1,600ms depending on choices. For voice assistants, anything under 1 second feels responsive; 1-2 seconds is acceptable; over 2 seconds feels broken. Deepgram's advantage of 200ms TTFB for TTS (vs 400ms OpenAI) is meaningful in this tight budget.
Optimizing the LLM stage: Start TTS synthesis before the LLM finishes. Stream the LLM output token by token, accumulate until you have a complete sentence (detect period/question mark), then send that sentence to TTS. This pipeline parallelism cuts the effective LLM + TTS latency significantly. Deepgram's WebSocket TTS API is designed for this pattern — you push text chunks and it streams audio back in real time.
Deepgram vs. separate STT + TTS providers: Using Deepgram for both STT and TTS simplifies the architecture (one API key, one SDK, one billing relationship) and Deepgram's WebSocket APIs for both can share a single connection. The quality of Deepgram Nova-3 for STT is excellent; Aura's voice quality is good but not ElevenLabs-tier. For voice assistants where naturalness is important (therapy apps, premium AI companions), combining Deepgram STT + ElevenLabs TTS gives the best quality at the cost of slightly higher latency and complexity.
Caching and Cost Optimization
TTS costs compound quickly in high-volume applications. A customer service bot handling 10,000 conversations/day, each with 5 responses averaging 200 characters, generates 10M characters/month — that's $150/month on OpenAI TTS or $3,000/month on ElevenLabs pay-as-you-go.
Aggressive caching: Cache TTS audio output by text hash. Common phrases like "How can I help you today?", "I'll transfer you to a specialist", or menu prompts are generated thousands of times but never change. Hash the (text, voice, model) tuple, look up in Redis or CDN, and only call the TTS API on cache miss. For a typical customer service bot, caching common phrases achieves 40-70% cache hit rates, directly reducing API costs.
Pre-generation for known content: For applications where the text is known in advance (product descriptions, educational content, navigation prompts), pre-generate all audio files and store them in R2 or S3. The TTS API cost is a one-time upfront expense; serving pre-generated audio is essentially free. Build a pipeline that detects content changes and regenerates only the affected audio files.
Model selection by use case within ElevenLabs: ElevenLabs' turbo models are 60% cheaper than multilingual_v2 with slightly lower quality. For applications where users won't notice the quality difference (short notifications, system prompts), use turbo. Reserve multilingual_v2 for long-form content where quality matters. This alone can cut ElevenLabs costs by 40%.
Batch synthesis: OpenAI TTS doesn't have a batch API, but you can parallelize requests with Promise.all() for non-realtime workloads. ElevenLabs has rate limits (request-level, not character-level), so parallel requests hit limits faster — implement a queue with concurrency control.
Storage format optimization: MP3 is the most compatible format, but WAV/PCM is better for real-time playback (no decode delay) and OPUS gives smaller file sizes for the same quality. For pre-generated content served via CDN, use MP3 at 128kbps — it's the best compatibility/size tradeoff for general web use. For real-time voice assistant audio, use linear16 PCM (raw) at 16KHz or 24KHz — it's larger but has zero decoding latency and works natively with Web Audio API and most audio pipelines. ElevenLabs supports MP3 44.1KHz 128kbps as its recommended format; Deepgram Aura defaults to linear16 for streaming use. OpenAI TTS offers MP3, OPUS, AAC, FLAC, WAV, and PCM — use PCM for real-time playback and MP3 for stored audio.
Monitoring TTS quality regression: Both ElevenLabs and OpenAI update their TTS models regularly. Set up a monthly smoke test that generates audio from a set of canonical test phrases and listens for obvious quality issues. Use MOS (Mean Opinion Score) automation tools like ViSQOL or DNSMOS to score audio quality programmatically. A quality regression in TTS can affect your product's user experience in ways that aren't visible in standard error rate monitoring — you need specific TTS quality checks.
ElevenLabs Voice Quality Deeper Dive
Voice quality is subjective, but ElevenLabs consistently ranks highest in blind listening tests for naturalness, prosody (rhythm and emphasis), and emotional expressiveness. The gap is most noticeable in:
- Emotional text: ElevenLabs handles exclamations, questions, and emotional context naturally. OpenAI TTS and Deepgram produce more monotone output on the same text.
- Long-form content: Over several paragraphs, ElevenLabs varies pacing and emphasis in ways that sound more like a skilled human narrator. OpenAI TTS maintains consistent but somewhat robotic delivery.
- Non-English languages: ElevenLabs' multilingual_v2 model produces natural-sounding accents in 29 languages. OpenAI TTS has limited multilingual support and tends to anglicize pronunciation. Deepgram Aura is English-only.
The quality gap has narrowed in 2026 as OpenAI improved TTS-1-HD. For casual use cases (notifications, summaries, navigation), the quality difference is negligible. For consumer-facing voice products where users judge the experience on voice quality, ElevenLabs remains the clear leader.
Voice cloning ethics: ElevenLabs requires users to confirm they have rights to the voice being cloned and comply with their terms of service regarding prohibited uses (impersonation, fraud, non-consensual cloning). In production systems, add logging for voice clone creation and usage to maintain an audit trail. The EU AI Act and emerging US state legislation increasingly regulate synthetic voice technology; consult legal counsel if you're building voice cloning features into a commercial product, particularly in contexts where users might confuse synthetic speech for a real person's voice.
Methodology
Latency figures (Deepgram ~200ms, OpenAI ~400ms, ElevenLabs ~500ms) are measured from request initiation to first audio byte received over a typical internet connection from US-based infrastructure. Results vary significantly by input text length, server region, and network conditions — ElevenLabs has data centers in the US and EU; routing to the nearest region is automatic but can be configured via API endpoint selection. Run your own latency benchmarks against your target region and typical input lengths before making architecture decisions based on these figures. Pricing data is from each provider's public pricing pages as of early 2026. ElevenLabs pricing is per-character on pay-as-you-go; subscription plans provide cost-effective bundled character allowances. OpenAI added the ash, coral, and sage voices in late 2024; the full voice list (alloy, ash, coral, echo, fable, nova, onyx, sage, shimmer) is current as of 2026. Deepgram's Aura model is their latest TTS model; older Aura-1 voices are still available but Aura-2 (Aura with model IDs like aura-asteria-en) is recommended for new integrations. ElevenLabs SDK v1.x (the elevenlabs npm package) is the current stable version and uses the v1 API endpoints. The older @elevenlabs/elevenlabs package is deprecated.
Compare all voice and speech APIs at APIScout.
Compare Deepgram and OpenAI on APIScout.
Related: Deepgram vs OpenAI Whisper API: Speech-to-Text Compared, Anthropic MCP vs OpenAI Plugins vs Gemini Extensions, Cloudflare Workers AI vs AWS Bedrock vs Azure OpenAI