ElevenLabs vs Cartesia: Best Voice AI API 2026
TL;DR
Cartesia for real-time voice agents — 199ms TTFA (vs ElevenLabs' 832ms), 27x cheaper at $0.011/1K chars, and the SSM architecture makes it the best latency choice for conversational AI. ElevenLabs for quality-first audio production — 70+ languages, the broadest voice library, dubbing, sound effects, and a complete platform for content creators and multilingual apps. The practical split in 2026: Cartesia for AI phone agents/voice assistants; ElevenLabs for narration, dubbing, and premium voice experiences.
Key Takeaways
- Cartesia pricing: $0.011/1K chars (~27x cheaper than ElevenLabs)
- ElevenLabs pricing: ~$0.206/1K chars (equivalent on creator+ plans)
- Cartesia TTFA: 199ms (Sonic model, self-serve tier)
- ElevenLabs TTFA: 832ms (self-serve tier), ~300ms on enterprise tier
- Architecture: Cartesia uses State Space Models (SSMs); ElevenLabs uses transformers
- Language support: ElevenLabs 70+ languages; Cartesia 15 languages
- Voice cloning: Cartesia requires 3 seconds; ElevenLabs requires 30 seconds
- Platform scope: ElevenLabs is full audio platform; Cartesia is API-only TTS
Why Voice AI Latency Matters
For voice agents (AI phone calls, real-time assistants, customer support bots), latency is the bottleneck. A 200ms TTFA feels like a natural conversation. An 800ms TTFA creates an awkward pause that feels broken.
User speaks → STT transcription → LLM inference → TTS → User hears response
Full turn latency budget:
STT: ~200ms (Deepgram/Whisper real-time)
LLM: ~400ms (streaming first token)
TTS: target <300ms TTFA
Total: ~900ms for natural conversation
Cartesia TTFA: 199ms → Total ~799ms (below 1s threshold)
ElevenLabs TTFA: 832ms → Total ~1432ms (above 1s, feels slow)
This is why Cartesia has dominated new voice agent deployments in 2026 — the latency advantage directly translates to better conversation quality.
Cartesia
Architecture: State Space Models
Cartesia's Sonic model is built on State Space Models (SSMs) — a fundamentally different architecture from transformer-based TTS. SSMs maintain a compact recurrent state that updates incrementally as text arrives, enabling streaming synthesis before the full sentence is processed.
# Cartesia Python SDK
from cartesia import Cartesia
import pyaudio
client = Cartesia(api_key=os.environ["CARTESIA_API_KEY"])
# Stream audio for low-latency playback
p = pyaudio.PyAudio()
rate = 44100
stream = p.open(format=pyaudio.paFloat32, channels=1, rate=rate, output=True)
# Generate and stream immediately
output_format = {
"container": "raw",
"encoding": "pcm_f32le",
"sample_rate": rate,
}
for output in client.tts.sse(
model_id="sonic-2",
transcript="Hello, how can I help you today?",
voice={"mode": "id", "id": "a0e99841-438c-4a64-b679-ae501e7d6091"},
output_format=output_format,
stream=True,
):
buffer = output.get("audio")
if buffer:
stream.write(buffer)
stream.stop_stream()
stream.close()
p.terminate()
WebSocket API for Real-Time Agents
For voice agents, use the WebSocket API to send text chunks as they arrive from the LLM:
import asyncio
import websockets
import json
async def voice_agent_response(llm_text_stream, voice_id: str):
"""Stream LLM output directly to Cartesia for ultra-low latency."""
uri = "wss://api.cartesia.ai/tts/websocket"
headers = {
"Cartesia-Version": "2024-06-10",
"X-API-Key": os.environ["CARTESIA_API_KEY"],
}
async with websockets.connect(uri, additional_headers=headers) as ws:
context_id = "ctx-001"
# Send text chunks as they arrive from LLM streaming
async for text_chunk in llm_text_stream:
await ws.send(json.dumps({
"context_id": context_id,
"model_id": "sonic-2",
"transcript": text_chunk,
"voice": {"mode": "id", "id": voice_id},
"output_format": {
"container": "raw",
"encoding": "pcm_f32le",
"sample_rate": 16000,
},
"continue": True, # More chunks coming
}))
# Signal end of utterance
await ws.send(json.dumps({
"context_id": context_id,
"transcript": "",
"continue": False,
}))
# Receive audio chunks and play/send to telephony
async for message in ws:
data = json.loads(message)
if audio := data.get("audio"):
yield base64.b64decode(audio)
Voice Cloning (3 seconds of audio)
# Clone a voice from 3 seconds of audio
import requests
response = requests.post(
"https://api.cartesia.ai/voices/clone/clip",
headers={
"Cartesia-Version": "2024-06-10",
"X-API-Key": os.environ["CARTESIA_API_KEY"],
},
files={"clip": open("sample.wav", "rb")},
data={"name": "Custom Voice"},
)
voice_id = response.json()["id"]
# Use immediately in generation
for output in client.tts.sse(
model_id="sonic-2",
transcript="Your cloned voice is ready.",
voice={"mode": "id", "id": voice_id},
output_format={"container": "mp3", "bit_rate": 128000, "sample_rate": 44100},
):
pass # Process audio chunks
ElevenLabs
The Full Audio Platform
ElevenLabs is more than TTS — it's a complete audio production platform. Beyond the API, it includes:
- Conversational AI: Pre-built voice agent framework with turn detection, interruption handling, and telephony integrations
- AI Dubbing: Automatically dub content into 29 languages preserving the original speaker's voice
- Text to Sound Effects: Generate custom SFX from text descriptions
- Studio: Long-form audio editor for narration and audiobooks
- ElevenReader: iOS/Android app that reads any content aloud
For developers, the API covers TTS, speech-to-speech, voice cloning, and the Conversational AI framework.
TTS API
from elevenlabs import ElevenLabs
client = ElevenLabs(api_key=os.environ["ELEVENLABS_API_KEY"])
# Basic TTS with the best quality model
audio = client.text_to_speech.convert(
voice_id="JBFqnCBsd6RMkjVDRZzb", # George — deep British narrator
model_id="eleven_turbo_v2_5", # Best latency/quality balance
text="The quick brown fox jumps over the lazy dog.",
output_format="mp3_44100_128",
voice_settings={
"stability": 0.5,
"similarity_boost": 0.75,
"style": 0.0,
"use_speaker_boost": True,
},
)
with open("output.mp3", "wb") as f:
for chunk in audio:
f.write(chunk)
Streaming for Low-Latency Apps
# Streaming TTS for voice agents
for audio_chunk in client.text_to_speech.convert_as_stream(
voice_id="JBFqnCBsd6RMkjVDRZzb",
model_id="eleven_flash_v2_5", # Fastest ElevenLabs model (~300ms enterprise)
text="How can I help you today?",
output_format="pcm_16000", # Raw PCM for telephony
):
# Send to telephony / WebSocket / audio buffer
send_audio(audio_chunk)
Multilingual TTS (70+ Languages)
# ElevenLabs handles non-English natively
audio = client.text_to_speech.convert(
voice_id="JBFqnCBsd6RMkjVDRZzb",
model_id="eleven_multilingual_v2",
text="Bonjour, comment puis-je vous aider aujourd'hui?", # French
language_code="fr",
output_format="mp3_44100_128",
)
# Auto-detect language (no language_code needed)
audio_ja = client.text_to_speech.convert(
voice_id="JBFqnCBsd6RMkjVDRZzb",
model_id="eleven_multilingual_v2",
text="こんにちは、本日はどのようなご用件でしょうか?", # Japanese
)
Conversational AI (Voice Agent Framework)
ElevenLabs includes a full voice agent SDK — not just TTS:
from elevenlabs.conversational_ai.conversation import Conversation
from elevenlabs.conversational_ai.default_audio_interface import DefaultAudioInterface
conversation = Conversation(
client=client,
agent_id=os.environ["ELEVENLABS_AGENT_ID"],
requires_auth=False,
audio_interface=DefaultAudioInterface(),
callback_agent_response=lambda response: print(f"Agent: {response}"),
callback_user_transcript=lambda transcript: print(f"User: {transcript}"),
)
conversation.start_session()
# Real-time two-way voice conversation — handles STT + LLM + TTS
Pricing Comparison
Cartesia (2026 pricing):
Free: 1,000 characters/month
Scale: $0.011 per 1,000 characters
Enterprise: Custom (volume discounts)
Example: 10M characters/month → $110/month
ElevenLabs (2026 pricing):
Free: 10,000 chars/month
Starter: $5/month — 30,000 chars ($0.167/K chars)
Creator: $22/month — 100,000 chars ($0.22/K chars)
Pro: $99/month — 500,000 chars ($0.198/K chars)
Scale: $330/month — 2,000,000 chars ($0.165/K chars)
Business: $1,320/month — 10,000,000 chars ($0.132/K chars)
Enterprise: Custom
Example: 10M characters/month → $1,320/month (vs Cartesia $110)
Cost ratio at 10M chars/month: ElevenLabs costs ~12x more. At 100M chars, Cartesia wins by an even larger margin. ElevenLabs' per-character rate improves with volume but never approaches Cartesia's pricing.
Latency Benchmarks
Time-to-First-Audio (TTFA) — p50 measurements:
Self-serve tier:
Cartesia Sonic: 199ms ← best for voice agents
ElevenLabs Turbo v2.5: ~450ms
ElevenLabs Flash v2.5: ~350ms
ElevenLabs Standard: ~832ms
Enterprise tier (dedicated infra):
Cartesia Sonic: ~150ms
ElevenLabs Flash v2.5: ~280ms
ElevenLabs Turbo: ~320ms
For context:
<300ms: Natural-feeling real-time conversation
300-600ms: Slight but noticeable delay
>600ms: Clearly perceptible pause, breaks conversational flow
Feature Comparison
| Feature | Cartesia | ElevenLabs |
|---|---|---|
| Price (per 1K chars) | ~$0.011 | ~$0.132–$0.206 |
| Best TTFA | 199ms | ~280ms (enterprise Flash) |
| Architecture | SSMs (recurrent) | Transformer |
| Languages | 15 | 70+ |
| Voice cloning speed | 3 seconds | 30 seconds |
| Voice cloning slots | Unlimited | 10–660 (plan-dependent) |
| WebSocket streaming | ✅ | ✅ |
| Conversational AI SDK | ❌ | ✅ full framework |
| AI dubbing | ❌ | ✅ (29 languages) |
| Sound effects | ❌ | ✅ |
| Voice design | ✅ | ✅ |
| Voice library | Limited | Massive (thousands) |
| Speech-to-speech | ❌ | ✅ |
| SOC 2 | ✅ | ✅ |
| HIPAA | Enterprise | Enterprise |
| SDKs | Python, JS/TS, Go | Python, JS/TS |
| Free tier | 1K chars/month | 10K chars/month |
Decision Guide
Choose Cartesia when:
- Building real-time voice agents (phone bots, voice assistants)
- Latency is critical — you need TTFA under 300ms
- Cost efficiency matters — high character volume
- English-primary or you only need 15 language support
- API-only, no platform features needed
Choose ElevenLabs when:
- You need 70+ language support
- Building multilingual dubbing pipelines
- Quality and voice variety matter more than latency
- Using the Conversational AI framework (built-in STT + LLM + TTS orchestration)
- Content creation, audiobooks, narration (not just real-time voice agents)
- You want the full audio platform (sound effects, studio)
Integrating with Voice Agent Pipelines
Neither Cartesia nor ElevenLabs exists in isolation — in production, they sit inside a pipeline: speech-to-text (STT) → LLM → TTS. The TTS choice affects the whole pipeline's end-to-end latency, not just the TTS step in isolation.
A production voice agent architecture with Cartesia:
User audio → Deepgram (STT, ~200ms) → streaming transcript
↓
Claude/GPT-4o (LLM, first token ~200ms)
↓
Cartesia Sonic (WebSocket, ~199ms TTFA)
↓
Audio to telephony/browser
The critical optimization is chaining the streams: as the LLM produces tokens, send them directly to Cartesia's WebSocket API without waiting for the full LLM response. Cartesia's architecture handles incomplete sentences and starts generating audio from the first few words. This reduces the perceived latency to nearly just the LLM's time-to-first-token — the TTS step is parallelized with LLM generation rather than sequential.
ElevenLabs' Conversational AI framework handles this chaining internally — it manages the STT, LLM, and TTS orchestration so you don't implement the streaming chain yourself. The trade-off: you configure the agent via ElevenLabs' dashboard and their LLM options, rather than bringing your own LLM choice. For teams that want full control over the LLM (model selection, system prompt, tool use), the raw TTS API with custom streaming is the right path.
Telephony integration is a common production scenario. Both Cartesia and ElevenLabs support Twilio integration — Cartesia via WebSocket with pcm_mulaw 8kHz output format (Twilio's format), ElevenLabs via their Conversational AI telephony integration. LiveKit (the leading WebRTC SFU for voice agents) has official integrations with both providers and handles the audio transport, room management, and participant coordination so you focus on the AI logic rather than WebRTC internals.
For browser-based voice interfaces (not telephony), both APIs work via the Web Audio API. Cartesia's raw PCM output requires client-side decoding; ElevenLabs' MP3 streaming is easier to play directly. The additional decode step with Cartesia adds negligible latency in browser contexts.
Methodology
Latency figures (TTFA) sourced from Cartesia's published benchmark page and ElevenLabs' model documentation as of March 2026. Cartesia's 199ms TTFA is measured for the Sonic 2 model at the self-serve tier from US East; latency varies by region and infrastructure tier. ElevenLabs' enterprise-tier latency (~280–300ms for Flash v2.5) is self-reported; independent measurements vary. Pricing sourced from both providers' published pricing pages as of March 2026. Character-to-cost ratio comparisons assume the standard plan pricing — ElevenLabs enterprise pricing may differ. Language counts: ElevenLabs 70+ as of March 2026, Cartesia 15 as of March 2026; both providers expand language support on a regular cadence. SSM architecture details sourced from Cartesia's technical blog. The Conversational AI framework comparison is based on ElevenLabs' documented framework capabilities; Cartesia's roadmap may include similar orchestration tooling in future releases. LiveKit integration availability verified from both providers' documentation pages as of March 2026.
Browse all voice AI and TTS APIs at APIScout.
Related: ElevenLabs vs OpenAI TTS vs Deepgram Aura · Best Voice and Speech APIs 2026, OpenAI Realtime API: Building Voice Applications 2026, How AI Is Transforming API Design and Documentation, Best AI Agent APIs 2026: Building Autonomous Workflows