OpenAI Realtime API: Building Voice Applications 2026
TL;DR
OpenAI's Realtime API lets you build voice applications where GPT-4o listens, thinks, and speaks — with sub-200ms latency. Unlike the older STT → LLM → TTS pipeline (which had 2-4 second lag), the Realtime API is end-to-end: raw audio in, GPT-4o processes it natively, audio out. It supports function calling mid-conversation, interruption handling, and multiple voice personas. As of 2026, this is the fastest way to build AI voice assistants. Here's everything you need to ship one.
Key Takeaways
- Latency: ~200ms end-to-end vs 2-4 seconds for STT+LLM+TTS pipeline
- Protocol: WebSocket (server-to-server) or WebRTC (browser-direct)
- Modalities: audio+text simultaneously — transcripts included with audio
- Function calling: works mid-conversation, model pauses and resumes speech
- Cost: ~$0.10/min (audio) + $0.01/1K input tokens — more expensive than text
- Voices: alloy, ash, ballad, coral, echo, sage, shimmer, verse (8 options)
Two Connection Modes
Mode 1: WebSocket (Server-to-Server)
Browser → Your Server → OpenAI Realtime API
Your server relays audio streams
Full control, works with any backend
Mode 2: WebRTC (Browser-Direct)
Browser → OpenAI Realtime API directly
Ephemeral tokens (30-second TTL)
Lower latency, less server infrastructure
Most production apps use WebSocket mode so the API key stays on the server. WebRTC mode is for demos and simple use cases.
WebSocket: Server-Side Setup
// server/realtime.ts — WebSocket relay server:
import WebSocket, { WebSocketServer } from 'ws';
import type { IncomingMessage } from 'http';
const OPENAI_API_KEY = process.env.OPENAI_API_KEY!;
const OPENAI_REALTIME_URL = 'wss://api.openai.com/v1/realtime?model=gpt-4o-realtime-preview';
export function createRealtimeRelay(wss: WebSocketServer) {
wss.on('connection', (clientWs: WebSocket, req: IncomingMessage) => {
console.log('Client connected');
// Connect to OpenAI Realtime API:
const openaiWs = new WebSocket(OPENAI_REALTIME_URL, {
headers: {
Authorization: `Bearer ${OPENAI_API_KEY}`,
'OpenAI-Beta': 'realtime=v1',
},
});
// Forward client → OpenAI:
clientWs.on('message', (message: Buffer) => {
if (openaiWs.readyState === WebSocket.OPEN) {
openaiWs.send(message);
}
});
// Forward OpenAI → client:
openaiWs.on('message', (message: Buffer) => {
if (clientWs.readyState === WebSocket.OPEN) {
clientWs.send(message);
}
});
// Handle disconnections:
clientWs.on('close', () => openaiWs.close());
openaiWs.on('close', () => clientWs.close());
openaiWs.on('open', () => {
console.log('Connected to OpenAI Realtime');
});
openaiWs.on('error', (err) => {
console.error('OpenAI WebSocket error:', err);
clientWs.close();
});
});
}
// app/api/realtime/route.ts — Upgrade HTTP → WebSocket:
import { createServer } from 'http';
import { WebSocketServer } from 'ws';
import { createRealtimeRelay } from '@/server/realtime';
// In Next.js, use a custom server for WebSocket:
// server.ts
const server = createServer();
const wss = new WebSocketServer({ server, path: '/api/realtime' });
createRealtimeRelay(wss);
server.listen(3001);
Session Configuration
After connecting, send a session.update event to configure the session:
// Send immediately after WebSocket opens:
const sessionConfig = {
type: 'session.update',
session: {
modalities: ['audio', 'text'], // Get both audio + transcript
instructions: `You are a helpful voice assistant.
Keep responses concise and conversational.
Do not use markdown in your responses.`,
voice: 'alloy', // alloy, ash, ballad, coral, echo, sage, shimmer, verse
input_audio_format: 'pcm16', // 24kHz, 16-bit, mono PCM
output_audio_format: 'pcm16',
input_audio_transcription: {
model: 'whisper-1', // Get text transcript of user speech
},
turn_detection: {
type: 'server_vad', // Server-side Voice Activity Detection
threshold: 0.5, // Sensitivity (0-1)
prefix_padding_ms: 300, // Audio before speech detected
silence_duration_ms: 500, // Silence before model responds
},
tools: [
{
type: 'function',
name: 'get_weather',
description: 'Get weather for a location',
parameters: {
type: 'object',
properties: {
location: { type: 'string', description: 'City name' },
},
required: ['location'],
},
},
],
tool_choice: 'auto',
temperature: 0.8,
max_response_output_tokens: 'inf', // or a number
},
};
ws.send(JSON.stringify(sessionConfig));
Browser: Capturing and Streaming Audio
// client/VoiceChat.tsx
'use client';
import { useEffect, useRef, useState } from 'react';
export function VoiceChat() {
const wsRef = useRef<WebSocket | null>(null);
const audioContextRef = useRef<AudioContext | null>(null);
const processorRef = useRef<ScriptProcessorNode | null>(null);
const [isConnected, setIsConnected] = useState(false);
const [transcript, setTranscript] = useState('');
const connect = async () => {
// Get microphone access:
const stream = await navigator.mediaDevices.getUserMedia({ audio: true });
// Connect to relay server:
const ws = new WebSocket('ws://localhost:3001/api/realtime');
wsRef.current = ws;
ws.onopen = () => {
setIsConnected(true);
// Set up audio capture (24kHz PCM16):
const audioContext = new AudioContext({ sampleRate: 24000 });
audioContextRef.current = audioContext;
const source = audioContext.createMediaStreamSource(stream);
// ScriptProcessor to capture raw PCM:
const processor = audioContext.createScriptProcessor(4096, 1, 1);
processorRef.current = processor;
processor.onaudioprocess = (e) => {
if (ws.readyState !== WebSocket.OPEN) return;
const inputData = e.inputBuffer.getChannelData(0);
// Convert float32 → int16 PCM:
const pcm16 = float32ToInt16(inputData);
// Base64 encode and send:
const base64Audio = btoa(
String.fromCharCode(...new Uint8Array(pcm16.buffer))
);
ws.send(JSON.stringify({
type: 'input_audio_buffer.append',
audio: base64Audio,
}));
};
source.connect(processor);
processor.connect(audioContext.destination);
};
// Handle incoming events:
ws.onmessage = (event) => {
const data = JSON.parse(event.data);
handleServerEvent(data);
};
};
const handleServerEvent = (event: Record<string, unknown>) => {
switch (event.type) {
case 'conversation.item.input_audio_transcription.completed':
// User speech transcribed:
setTranscript(`You: ${event.transcript}`);
break;
case 'response.audio.delta':
// Incremental audio chunk from model — play it:
playAudioChunk(event.delta as string);
break;
case 'response.text.delta':
// Incremental text transcript of model speech:
setTranscript((prev) => prev + (event.delta as string));
break;
case 'response.function_call_arguments.done':
// Model wants to call a function:
handleFunctionCall(
event.name as string,
JSON.parse(event.arguments as string),
event.call_id as string
);
break;
case 'response.done':
console.log('Response complete');
break;
}
};
const handleFunctionCall = async (name: string, args: unknown, callId: string) => {
let result: unknown;
if (name === 'get_weather') {
result = { temperature: 22, condition: 'sunny', location: (args as any).location };
}
// Return function result to model:
wsRef.current?.send(JSON.stringify({
type: 'conversation.item.create',
item: {
type: 'function_call_output',
call_id: callId,
output: JSON.stringify(result),
},
}));
// Tell model to continue responding:
wsRef.current?.send(JSON.stringify({ type: 'response.create' }));
};
// Audio playback queue:
const audioQueue: AudioBuffer[] = [];
let isPlaying = false;
const playAudioChunk = (base64Audio: string) => {
if (!audioContextRef.current) return;
const binary = atob(base64Audio);
const bytes = new Uint8Array(binary.length);
for (let i = 0; i < binary.length; i++) bytes[i] = binary.charCodeAt(i);
// Decode PCM16 to float32:
const int16 = new Int16Array(bytes.buffer);
const float32 = new Float32Array(int16.length);
for (let i = 0; i < int16.length; i++) float32[i] = int16[i] / 32768;
const buffer = audioContextRef.current.createBuffer(1, float32.length, 24000);
buffer.getChannelData(0).set(float32);
audioQueue.push(buffer);
if (!isPlaying) playNext();
};
const playNext = () => {
if (!audioContextRef.current || audioQueue.length === 0) {
isPlaying = false;
return;
}
isPlaying = true;
const buffer = audioQueue.shift()!;
const source = audioContextRef.current.createBufferSource();
source.buffer = buffer;
source.connect(audioContextRef.current.destination);
source.onended = playNext;
source.start();
};
return (
<div className="flex flex-col items-center gap-4 p-8">
<h1 className="text-2xl font-bold">Voice Assistant</h1>
{!isConnected ? (
<button
onClick={connect}
className="px-6 py-3 bg-black text-white rounded-lg"
>
Start Conversation
</button>
) : (
<div className="text-center">
<div className="w-4 h-4 bg-green-500 rounded-full animate-pulse mx-auto mb-2" />
<p className="text-sm text-gray-500">Listening...</p>
<p className="mt-4 max-w-md">{transcript}</p>
</div>
)}
</div>
);
}
function float32ToInt16(buffer: Float32Array): Int16Array {
const int16 = new Int16Array(buffer.length);
for (let i = 0; i < buffer.length; i++) {
const s = Math.max(-1, Math.min(1, buffer[i]));
int16[i] = s < 0 ? s * 0x8000 : s * 0x7fff;
}
return int16;
}
Handling Interruptions
One of the Realtime API's key features: users can interrupt the model mid-sentence:
// When VAD detects user started speaking during model response:
// Server automatically sends: { type: 'input_audio_buffer.speech_started' }
// The model's audio is cut off — you should also stop playback:
ws.onmessage = (event) => {
const data = JSON.parse(event.data);
if (data.type === 'input_audio_buffer.speech_started') {
// User interrupted — stop current audio playback:
audioQueue.length = 0; // Clear queue
if (audioContextRef.current) {
// Stop all sources by recreating context (fast approach):
audioContextRef.current.close();
audioContextRef.current = new AudioContext({ sampleRate: 24000 });
}
}
};
WebRTC Mode (Browser-Direct)
For lower-latency browser apps, WebRTC bypasses your server:
// Step 1: Get ephemeral token from your server:
// GET /api/realtime/token → { client_secret: { value: "..." } }
// server: app/api/realtime/token/route.ts
export async function GET() {
const res = await fetch('https://api.openai.com/v1/realtime/sessions', {
method: 'POST',
headers: {
Authorization: `Bearer ${process.env.OPENAI_API_KEY}`,
'Content-Type': 'application/json',
},
body: JSON.stringify({
model: 'gpt-4o-realtime-preview',
voice: 'alloy',
}),
});
const session = await res.json();
return Response.json({ token: session.client_secret.value });
}
// Step 2: Use ephemeral token in browser for WebRTC:
const { token } = await fetch('/api/realtime/token').then((r) => r.json());
const pc = new RTCPeerConnection();
const stream = await navigator.mediaDevices.getUserMedia({ audio: true });
pc.addTrack(stream.getTracks()[0]);
const dc = pc.createDataChannel('oai-events'); // For sending/receiving events
const offer = await pc.createOffer();
await pc.setLocalDescription(offer);
const sdpResponse = await fetch(
'https://api.openai.com/v1/realtime?model=gpt-4o-realtime-preview',
{
method: 'POST',
headers: {
Authorization: `Bearer ${token}`,
'Content-Type': 'application/sdp',
},
body: offer.sdp,
}
);
const answer = { type: 'answer' as const, sdp: await sdpResponse.text() };
await pc.setRemoteDescription(answer);
// Audio flows directly browser ↔ OpenAI
Pricing Reality
Realtime API (gpt-4o-realtime-preview):
Audio input: $0.10 / min → 10-minute call = $1.00
Audio output: $0.20 / min → 10-minute call = $2.00
Text tokens: $2.50/1M input, $10/1M output
vs. Traditional STT + LLM + TTS pipeline:
Whisper (STT): $0.006/min
GPT-4o: ~$0.01-0.05 per conversation turn
OpenAI TTS: $0.015/1K chars (~$0.006/min at 400 wpm)
Total pipeline: ~$0.03/min
Realtime is ~5-10x more expensive than the pipeline approach.
The tradeoff: 200ms latency vs 2-4 seconds.
For voice assistants where latency feels conversational,
the Realtime API is worth the cost.
When to Use the Realtime API
Use Realtime API if:
- You're building a voice assistant or voice chat interface
- Sub-second response time is critical to the UX
- Interruption handling matters (users will cut off responses)
- You need mid-conversation function calling
Stick with STT+LLM+TTS if:
- Cost is the primary constraint (5-10x cheaper)
- Your use case tolerates 2-4 second delays (voice memos, dictation)
- You need more control over each step (custom STT, specific TTS voices)
- You're already invested in a Deepgram/ElevenLabs/Whisper pipeline
Error Handling and Reconnection Logic
Production voice applications need robust reconnection logic as a first-class concern, not an afterthought. WebSocket connections drop — mobile networks are particularly unreliable, and even stable desktop connections will lose the relay occasionally. The Realtime API does not support session resumption: when a connection drops and you reconnect, you start a fresh session with no memory of the previous conversation. This means your application needs to handle mid-conversation drops gracefully, preserving the transcript on the client and showing clear UI feedback while reconnecting.
The standard pattern is exponential backoff: start with a 1-second delay, double on each failure, cap at 30 seconds. This prevents thundering-herd reconnection storms when a relay server restarts, and gives the network time to stabilize before retrying:
async function connectWithRetry(maxAttempts = 5) {
for (let attempt = 0; attempt < maxAttempts; attempt++) {
try {
const ws = new WebSocket(RELAY_URL);
await waitForOpen(ws);
return ws;
} catch {
const delay = Math.min(1000 * 2 ** attempt, 30000);
await new Promise(resolve => setTimeout(resolve, delay));
}
}
throw new Error('Failed to connect after retries');
}
Handle the session.expired event explicitly — sessions have a maximum duration and will terminate, not just drop silently. When you receive this event, show the user a "Session ended — tap to start a new conversation" message rather than attempting to reconnect indefinitely. Store the transcript in component state so it persists across reconnections, giving users a record of what was said before the drop. A "Reconnecting..." status indicator with a spinner beats a frozen UI while the retry logic runs in the background.
Scaling Realtime API Applications
The WebSocket relay architecture that keeps your API key safe on the server also creates the central scaling challenge: each active voice session requires a persistent, stateful WebSocket connection on your relay server. Unlike HTTP, you can't route WebSocket traffic through a standard round-robin load balancer — the relay that opened the connection to OpenAI must continue handling that session for its entire duration. At 100 concurrent users, a single relay server handles this easily. At 1,000+ concurrent users, you need to design for it explicitly.
The two main approaches are managed relay layers and sticky-session load balancing. Cloudflare Durable Objects provide a natural fit: each Durable Object holds one session's state and its connection to OpenAI, and Cloudflare handles routing subsequent messages to the correct object. AWS API Gateway's WebSocket APIs offer similar behavior with AWS infrastructure. Both remove the scaling constraint from your own servers. For teams already running their own infrastructure, sticky sessions (where the load balancer consistently routes a client to the same backend server based on a session cookie or connection ID) combined with Redis Pub/Sub for cross-server message routing is the standard pattern — messages from OpenAI arrive at the relay server, get published to Redis, and any server that needs them can subscribe.
Monitor concurrent active sessions as your primary scaling metric — not requests per second, which is an HTTP concept. Before designing for high concurrency, check OpenAI's current documentation for concurrent session rate limits, since those limits impose a ceiling regardless of your relay architecture. At high scale, the relay connection count and the OpenAI session limit will both constrain you.
Evaluating and Selecting Voices
The Realtime API offers eight voices as of early 2026: alloy, ash, ballad, coral, echo, sage, shimmer, and verse. The difference between them is not just aesthetic — tone affects user perception of competence, warmth, and authority in ways that matter for specific applications.
Alloy is neutral and professional, making it the safe default for business applications. Coral and sage read as warmer and more conversational, better for consumer-facing assistants where approachability matters. Ballad and verse have more expressive range, which works for entertainment or emotional support use cases but can feel mismatched in technical or transactional contexts. Echo and shimmer are higher-pitched and tend to read as more energetic.
The practical advice is to test all eight voices with a sample of your actual use cases — not synthetic demos — before committing. Record 2-3 representative conversations with each voice, then have your team (and ideally a small group of target users) rate them blind on clarity, trustworthiness, and appropriateness for the context. Voice selection is one of the highest-ROI product decisions in a voice application; users form strong opinions quickly and consistently, and the wrong voice creates friction that no amount of backend optimization can fix.
One important technical note: voice cannot be changed mid-session. Set the voice parameter in session.update when the connection opens, and it persists for the full session duration. If your application serves multiple user segments that would benefit from different voices (formal enterprise users vs. casual consumer users), handle voice selection server-side at session initialization based on user profile or context.
Testing Voice Applications Without Speaking
Automated testing for voice applications requires a different approach than testing standard APIs. You can't reliably run CI tests that require a human to speak into a microphone, but you also can't ship a voice product without systematic testing.
The most effective approach is to separate the layers: test the WebSocket event handling and function call dispatch using pre-recorded audio clips as fixtures, and test the audio capture/playback UI with WebSocket mocks. For the event-handling layer, you can feed base64-encoded PCM audio from a file into your input_audio_buffer.append handler and simulate the full conversation state machine without any live audio:
Maintain a library of test audio clips covering: normal speech, background noise, interruptions mid-response, incomplete sentences, and edge-case inputs (very long utterances, multiple questions in one turn). Run these against your relay server in staging with a real OpenAI connection to catch regressions in function call handling and session configuration. For the audio rendering layer, mock the WebSocket and emit pre-scripted server events (audio deltas, function call arguments, transcripts) to test that your playback queue, interrupt handling, and UI state all respond correctly.
Integration tests against the live Realtime API are expensive ($0.10/min) but necessary for pre-release verification. Keep a small suite of 3-5 scripted conversation flows that exercise the critical paths — session setup, a successful function call round-trip, and an interruption — and run them on every merge to main.
Cost Control and Monitoring
At $0.10 per minute for audio input and $0.20 per minute for audio output, cost accumulates quickly in a busy voice application. A voice assistant averaging 5 minutes per session at 500 daily sessions runs to roughly $375/day — meaningful money that warrants active monitoring and control. The good news is that most real-world voice conversations are short: users tend to ask a question, get an answer, and hang up. Real-world averages cluster around 2-3 minutes per session, putting per-call cost in the $0.60-$0.90 range at current pricing.
The most effective cost controls are session time limits and token caps. Automatically disconnect sessions after 5-10 minutes — most legitimate voice interactions complete well within that window, and open sessions left running (browser tab in background, forgotten assistant) are pure waste. Set max_response_output_tokens to a specific number rather than inf; this caps how long the model's spoken responses can run, which directly reduces audio output cost. Implement per-user session budgets in your relay server — track cumulative session time per user ID and reject new connections once they've hit their daily limit.
For monitoring, OpenAI's usage dashboard provides per-minute cost breakdowns that let you identify which time periods drive the most cost and correlate usage spikes with product events. Compare total Realtime API spend against the alternative pipeline (Deepgram STT + GPT-4o + ElevenLabs TTS runs roughly $0.05-$0.10 per minute for comparable quality) to confirm the latency improvement is worth the cost premium at your current scale. At low volume, the Realtime API's premium is negligible. At high volume, that 5-10x cost difference may justify building the pipeline even with its additional complexity.
Find and compare voice and speech APIs at APIScout.
Related: Building an AI Agent in 2026, Building a Communication Platform, Building Real-Time APIs