Skip to main content

ElevenLabs vs Cartesia: Best Voice AI API 2026

·APIScout Team
Share:

TL;DR

Cartesia for real-time voice agents — 199ms TTFA (vs ElevenLabs' 832ms), 27x cheaper at $0.011/1K chars, and the SSM architecture makes it the best latency choice for conversational AI. ElevenLabs for quality-first audio production — 70+ languages, the broadest voice library, dubbing, sound effects, and a complete platform for content creators and multilingual apps. The practical split in 2026: Cartesia for AI phone agents/voice assistants; ElevenLabs for narration, dubbing, and premium voice experiences.

Key Takeaways

  • Cartesia pricing: $0.011/1K chars (~27x cheaper than ElevenLabs)
  • ElevenLabs pricing: ~$0.206/1K chars (equivalent on creator+ plans)
  • Cartesia TTFA: 199ms (Sonic model, self-serve tier)
  • ElevenLabs TTFA: 832ms (self-serve tier), ~300ms on enterprise tier
  • Architecture: Cartesia uses State Space Models (SSMs); ElevenLabs uses transformers
  • Language support: ElevenLabs 70+ languages; Cartesia 15 languages
  • Voice cloning: Cartesia requires 3 seconds; ElevenLabs requires 30 seconds
  • Platform scope: ElevenLabs is full audio platform; Cartesia is API-only TTS

Why Voice AI Latency Matters

For voice agents (AI phone calls, real-time assistants, customer support bots), latency is the bottleneck. A 200ms TTFA feels like a natural conversation. An 800ms TTFA creates an awkward pause that feels broken.

User speaks → STT transcription → LLM inference → TTS → User hears response

Full turn latency budget:
  STT:    ~200ms (Deepgram/Whisper real-time)
  LLM:    ~400ms (streaming first token)
  TTS:    target <300ms TTFA
  Total:  ~900ms for natural conversation

Cartesia TTFA:   199ms → Total ~799ms (below 1s threshold)
ElevenLabs TTFA: 832ms → Total ~1432ms (above 1s, feels slow)

This is why Cartesia has dominated new voice agent deployments in 2026 — the latency advantage directly translates to better conversation quality.


Cartesia

Architecture: State Space Models

Cartesia's Sonic model is built on State Space Models (SSMs) — a fundamentally different architecture from transformer-based TTS. SSMs maintain a compact recurrent state that updates incrementally as text arrives, enabling streaming synthesis before the full sentence is processed.

# Cartesia Python SDK
from cartesia import Cartesia
import pyaudio

client = Cartesia(api_key=os.environ["CARTESIA_API_KEY"])

# Stream audio for low-latency playback
p = pyaudio.PyAudio()
rate = 44100
stream = p.open(format=pyaudio.paFloat32, channels=1, rate=rate, output=True)

# Generate and stream immediately
output_format = {
    "container": "raw",
    "encoding": "pcm_f32le",
    "sample_rate": rate,
}

for output in client.tts.sse(
    model_id="sonic-2",
    transcript="Hello, how can I help you today?",
    voice={"mode": "id", "id": "a0e99841-438c-4a64-b679-ae501e7d6091"},
    output_format=output_format,
    stream=True,
):
    buffer = output.get("audio")
    if buffer:
        stream.write(buffer)

stream.stop_stream()
stream.close()
p.terminate()

WebSocket API for Real-Time Agents

For voice agents, use the WebSocket API to send text chunks as they arrive from the LLM:

import asyncio
import websockets
import json

async def voice_agent_response(llm_text_stream, voice_id: str):
    """Stream LLM output directly to Cartesia for ultra-low latency."""
    uri = "wss://api.cartesia.ai/tts/websocket"
    headers = {
        "Cartesia-Version": "2024-06-10",
        "X-API-Key": os.environ["CARTESIA_API_KEY"],
    }

    async with websockets.connect(uri, additional_headers=headers) as ws:
        context_id = "ctx-001"

        # Send text chunks as they arrive from LLM streaming
        async for text_chunk in llm_text_stream:
            await ws.send(json.dumps({
                "context_id": context_id,
                "model_id": "sonic-2",
                "transcript": text_chunk,
                "voice": {"mode": "id", "id": voice_id},
                "output_format": {
                    "container": "raw",
                    "encoding": "pcm_f32le",
                    "sample_rate": 16000,
                },
                "continue": True,  # More chunks coming
            }))

        # Signal end of utterance
        await ws.send(json.dumps({
            "context_id": context_id,
            "transcript": "",
            "continue": False,
        }))

        # Receive audio chunks and play/send to telephony
        async for message in ws:
            data = json.loads(message)
            if audio := data.get("audio"):
                yield base64.b64decode(audio)

Voice Cloning (3 seconds of audio)

# Clone a voice from 3 seconds of audio
import requests

response = requests.post(
    "https://api.cartesia.ai/voices/clone/clip",
    headers={
        "Cartesia-Version": "2024-06-10",
        "X-API-Key": os.environ["CARTESIA_API_KEY"],
    },
    files={"clip": open("sample.wav", "rb")},
    data={"name": "Custom Voice"},
)
voice_id = response.json()["id"]

# Use immediately in generation
for output in client.tts.sse(
    model_id="sonic-2",
    transcript="Your cloned voice is ready.",
    voice={"mode": "id", "id": voice_id},
    output_format={"container": "mp3", "bit_rate": 128000, "sample_rate": 44100},
):
    pass  # Process audio chunks

ElevenLabs

The Full Audio Platform

ElevenLabs is more than TTS — it's a complete audio production platform. Beyond the API, it includes:

  • Conversational AI: Pre-built voice agent framework with turn detection, interruption handling, and telephony integrations
  • AI Dubbing: Automatically dub content into 29 languages preserving the original speaker's voice
  • Text to Sound Effects: Generate custom SFX from text descriptions
  • Studio: Long-form audio editor for narration and audiobooks
  • ElevenReader: iOS/Android app that reads any content aloud

For developers, the API covers TTS, speech-to-speech, voice cloning, and the Conversational AI framework.

TTS API

from elevenlabs import ElevenLabs

client = ElevenLabs(api_key=os.environ["ELEVENLABS_API_KEY"])

# Basic TTS with the best quality model
audio = client.text_to_speech.convert(
    voice_id="JBFqnCBsd6RMkjVDRZzb",  # George — deep British narrator
    model_id="eleven_turbo_v2_5",       # Best latency/quality balance
    text="The quick brown fox jumps over the lazy dog.",
    output_format="mp3_44100_128",
    voice_settings={
        "stability": 0.5,
        "similarity_boost": 0.75,
        "style": 0.0,
        "use_speaker_boost": True,
    },
)

with open("output.mp3", "wb") as f:
    for chunk in audio:
        f.write(chunk)

Streaming for Low-Latency Apps

# Streaming TTS for voice agents
for audio_chunk in client.text_to_speech.convert_as_stream(
    voice_id="JBFqnCBsd6RMkjVDRZzb",
    model_id="eleven_flash_v2_5",  # Fastest ElevenLabs model (~300ms enterprise)
    text="How can I help you today?",
    output_format="pcm_16000",  # Raw PCM for telephony
):
    # Send to telephony / WebSocket / audio buffer
    send_audio(audio_chunk)

Multilingual TTS (70+ Languages)

# ElevenLabs handles non-English natively
audio = client.text_to_speech.convert(
    voice_id="JBFqnCBsd6RMkjVDRZzb",
    model_id="eleven_multilingual_v2",
    text="Bonjour, comment puis-je vous aider aujourd'hui?",  # French
    language_code="fr",
    output_format="mp3_44100_128",
)

# Auto-detect language (no language_code needed)
audio_ja = client.text_to_speech.convert(
    voice_id="JBFqnCBsd6RMkjVDRZzb",
    model_id="eleven_multilingual_v2",
    text="こんにちは、本日はどのようなご用件でしょうか?",  # Japanese
)

Conversational AI (Voice Agent Framework)

ElevenLabs includes a full voice agent SDK — not just TTS:

from elevenlabs.conversational_ai.conversation import Conversation
from elevenlabs.conversational_ai.default_audio_interface import DefaultAudioInterface

conversation = Conversation(
    client=client,
    agent_id=os.environ["ELEVENLABS_AGENT_ID"],
    requires_auth=False,
    audio_interface=DefaultAudioInterface(),
    callback_agent_response=lambda response: print(f"Agent: {response}"),
    callback_user_transcript=lambda transcript: print(f"User: {transcript}"),
)

conversation.start_session()
# Real-time two-way voice conversation — handles STT + LLM + TTS

Pricing Comparison

Cartesia (2026 pricing):
  Free:        1,000 characters/month
  Scale:       $0.011 per 1,000 characters
  Enterprise:  Custom (volume discounts)

  Example: 10M characters/month → $110/month

ElevenLabs (2026 pricing):
  Free:        10,000 chars/month
  Starter:     $5/month — 30,000 chars ($0.167/K chars)
  Creator:     $22/month — 100,000 chars ($0.22/K chars)
  Pro:         $99/month — 500,000 chars ($0.198/K chars)
  Scale:       $330/month — 2,000,000 chars ($0.165/K chars)
  Business:    $1,320/month — 10,000,000 chars ($0.132/K chars)
  Enterprise:  Custom

  Example: 10M characters/month → $1,320/month (vs Cartesia $110)

Cost ratio at 10M chars/month: ElevenLabs costs ~12x more. At 100M chars, Cartesia wins by an even larger margin. ElevenLabs' per-character rate improves with volume but never approaches Cartesia's pricing.


Latency Benchmarks

Time-to-First-Audio (TTFA) — p50 measurements:

Self-serve tier:
  Cartesia Sonic:          199ms ← best for voice agents
  ElevenLabs Turbo v2.5:  ~450ms
  ElevenLabs Flash v2.5:  ~350ms
  ElevenLabs Standard:    ~832ms

Enterprise tier (dedicated infra):
  Cartesia Sonic:          ~150ms
  ElevenLabs Flash v2.5:  ~280ms
  ElevenLabs Turbo:        ~320ms

For context:
  <300ms:   Natural-feeling real-time conversation
  300-600ms: Slight but noticeable delay
  >600ms:   Clearly perceptible pause, breaks conversational flow

Feature Comparison

FeatureCartesiaElevenLabs
Price (per 1K chars)~$0.011~$0.132–$0.206
Best TTFA199ms~280ms (enterprise Flash)
ArchitectureSSMs (recurrent)Transformer
Languages1570+
Voice cloning speed3 seconds30 seconds
Voice cloning slotsUnlimited10–660 (plan-dependent)
WebSocket streaming
Conversational AI SDK✅ full framework
AI dubbing✅ (29 languages)
Sound effects
Voice design
Voice libraryLimitedMassive (thousands)
Speech-to-speech
SOC 2
HIPAAEnterpriseEnterprise
SDKsPython, JS/TS, GoPython, JS/TS
Free tier1K chars/month10K chars/month

Decision Guide

Choose Cartesia when:

  • Building real-time voice agents (phone bots, voice assistants)
  • Latency is critical — you need TTFA under 300ms
  • Cost efficiency matters — high character volume
  • English-primary or you only need 15 language support
  • API-only, no platform features needed

Choose ElevenLabs when:

  • You need 70+ language support
  • Building multilingual dubbing pipelines
  • Quality and voice variety matter more than latency
  • Using the Conversational AI framework (built-in STT + LLM + TTS orchestration)
  • Content creation, audiobooks, narration (not just real-time voice agents)
  • You want the full audio platform (sound effects, studio)

Integrating with Voice Agent Pipelines

Neither Cartesia nor ElevenLabs exists in isolation — in production, they sit inside a pipeline: speech-to-text (STT) → LLM → TTS. The TTS choice affects the whole pipeline's end-to-end latency, not just the TTS step in isolation.

A production voice agent architecture with Cartesia:

User audio → Deepgram (STT, ~200ms) → streaming transcript
                              ↓
                 Claude/GPT-4o (LLM, first token ~200ms)
                              ↓
               Cartesia Sonic (WebSocket, ~199ms TTFA)
                              ↓
                     Audio to telephony/browser

The critical optimization is chaining the streams: as the LLM produces tokens, send them directly to Cartesia's WebSocket API without waiting for the full LLM response. Cartesia's architecture handles incomplete sentences and starts generating audio from the first few words. This reduces the perceived latency to nearly just the LLM's time-to-first-token — the TTS step is parallelized with LLM generation rather than sequential.

ElevenLabs' Conversational AI framework handles this chaining internally — it manages the STT, LLM, and TTS orchestration so you don't implement the streaming chain yourself. The trade-off: you configure the agent via ElevenLabs' dashboard and their LLM options, rather than bringing your own LLM choice. For teams that want full control over the LLM (model selection, system prompt, tool use), the raw TTS API with custom streaming is the right path.

Telephony integration is a common production scenario. Both Cartesia and ElevenLabs support Twilio integration — Cartesia via WebSocket with pcm_mulaw 8kHz output format (Twilio's format), ElevenLabs via their Conversational AI telephony integration. LiveKit (the leading WebRTC SFU for voice agents) has official integrations with both providers and handles the audio transport, room management, and participant coordination so you focus on the AI logic rather than WebRTC internals.

For browser-based voice interfaces (not telephony), both APIs work via the Web Audio API. Cartesia's raw PCM output requires client-side decoding; ElevenLabs' MP3 streaming is easier to play directly. The additional decode step with Cartesia adds negligible latency in browser contexts.

Methodology

Latency figures (TTFA) sourced from Cartesia's published benchmark page and ElevenLabs' model documentation as of March 2026. Cartesia's 199ms TTFA is measured for the Sonic 2 model at the self-serve tier from US East; latency varies by region and infrastructure tier. ElevenLabs' enterprise-tier latency (~280–300ms for Flash v2.5) is self-reported; independent measurements vary. Pricing sourced from both providers' published pricing pages as of March 2026. Character-to-cost ratio comparisons assume the standard plan pricing — ElevenLabs enterprise pricing may differ. Language counts: ElevenLabs 70+ as of March 2026, Cartesia 15 as of March 2026; both providers expand language support on a regular cadence. SSM architecture details sourced from Cartesia's technical blog. The Conversational AI framework comparison is based on ElevenLabs' documented framework capabilities; Cartesia's roadmap may include similar orchestration tooling in future releases. LiveKit integration availability verified from both providers' documentation pages as of March 2026.


Browse all voice AI and TTS APIs at APIScout.

Related: ElevenLabs vs OpenAI TTS vs Deepgram Aura · Best Voice and Speech APIs 2026, OpenAI Realtime API: Building Voice Applications 2026, How AI Is Transforming API Design and Documentation, Best AI Agent APIs 2026: Building Autonomous Workflows

The API Integration Checklist (Free PDF)

Step-by-step checklist: auth setup, rate limit handling, error codes, SDK evaluation, and pricing comparison for 50+ APIs. Used by 200+ developers.

Join 200+ developers. Unsubscribe in one click.