Skip to main content

How to Build a Multi-Provider AI App 2026

·APIScout Team
Share:

TL;DR

Multi-provider setup is roughly 2-3 hours of work upfront, but it's the kind of work that prevents a single LLM outage from taking down your product entirely. OpenRouter and LiteLLM can give you multi-provider support with minimal code — they abstract all providers behind a single OpenAI-compatible interface. But a hand-rolled abstraction gives you more control over routing logic, cost attribution, and fallback behavior. This guide takes the hand-rolled approach so you understand what's happening under the hood.

Why Single-Provider LLM Apps Are Brittle

OpenAI recorded 47 status incidents in 2024 — roughly one every eight days. Some were minor degraded-performance events, but several caused complete API unavailability for periods ranging from 30 minutes to several hours. If your application is a customer-facing product, those outages are your outages. Your users don't care which vendor is responsible.

The reliability problem is compounded by pricing volatility. Anthropic made significant pricing changes mid-2024 that effectively doubled costs for some high-volume use cases. Google Gemini restructured its free tier and rate limits multiple times across the same period. Any application that hard-codes a single provider is exposed not just to downtime risk, but to cost surprises that can materially affect your unit economics with little warning.

A multi-provider architecture addresses both problems simultaneously. You get automatic failover when a provider goes down, and you get the ability to route traffic based on cost, quality requirements, and per-task suitability. Done right, you can also A/B test models on real traffic without changing application code. If you want a deeper comparison of the providers themselves before choosing your primary, see How to Choose an LLM API in 2026.

What You'll Build

  • Unified interface for OpenAI, Anthropic, and Google Gemini
  • Automatic fallback when a provider is down
  • Task-based routing (use the best model for each task)
  • Cost optimization (route to cheapest provider that meets quality needs)
  • Streaming support across all providers

Prerequisites: Node.js 18+, API keys from at least 2 providers.

1. Setup

The three major providers each have their own SDK with different APIs, different message formats, and different response shapes. Before we write any routing logic, we need clean adapters that normalize all three into a single interface. This is the most important architectural decision in the whole guide — if the adapter layer is clean, everything built on top of it becomes simple.

Install SDKs

npm install openai @anthropic-ai/sdk @google/generative-ai

Environment Variables

OPENAI_API_KEY=sk-...
ANTHROPIC_API_KEY=sk-ant-...
GOOGLE_AI_API_KEY=AIza...

Initialize Clients

// lib/ai-providers.ts
import OpenAI from 'openai';
import Anthropic from '@anthropic-ai/sdk';
import { GoogleGenerativeAI } from '@google/generative-ai';

export const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });
export const anthropic = new Anthropic({ apiKey: process.env.ANTHROPIC_API_KEY });
export const google = new GoogleGenerativeAI(process.env.GOOGLE_AI_API_KEY!);

2. Unified Interface

The biggest impedance mismatch between providers is how they handle system messages. OpenAI puts the system message in the messages array with role: "system". Anthropic separates it into a top-level system parameter. Google uses systemInstruction. None of them agree on response shapes either — OpenAI returns choices[0].message.content, Anthropic returns content[0].text, Google returns result.response.text().

The adapter layer below absorbs all of this. Your application code only ever sees AIMessage[] in and AIResponse out, regardless of which provider handles the request.

Define Common Types

// lib/ai-types.ts
export interface AIMessage {
  role: 'system' | 'user' | 'assistant';
  content: string;
}

export interface AIResponse {
  content: string;
  provider: 'openai' | 'anthropic' | 'google';
  model: string;
  usage: {
    inputTokens: number;
    outputTokens: number;
  };
  latencyMs: number;
}

export interface AIOptions {
  messages: AIMessage[];
  maxTokens?: number;
  temperature?: number;
  stream?: boolean;
}

Provider Adapters

// lib/adapters/openai-adapter.ts
import { openai } from '../ai-providers';
import { AIMessage, AIResponse, AIOptions } from '../ai-types';

export async function callOpenAI(options: AIOptions): Promise<AIResponse> {
  const start = Date.now();

  const response = await openai.chat.completions.create({
    model: 'gpt-4o',
    messages: options.messages,
    max_tokens: options.maxTokens ?? 1024,
    temperature: options.temperature ?? 0.7,
  });

  return {
    content: response.choices[0].message.content ?? '',
    provider: 'openai',
    model: 'gpt-4o',
    usage: {
      inputTokens: response.usage?.prompt_tokens ?? 0,
      outputTokens: response.usage?.completion_tokens ?? 0,
    },
    latencyMs: Date.now() - start,
  };
}
// lib/adapters/anthropic-adapter.ts
import { anthropic } from '../ai-providers';
import { AIMessage, AIResponse, AIOptions } from '../ai-types';

export async function callAnthropic(options: AIOptions): Promise<AIResponse> {
  const start = Date.now();
  const systemMessage = options.messages.find(m => m.role === 'system');
  const chatMessages = options.messages.filter(m => m.role !== 'system');

  const response = await anthropic.messages.create({
    model: 'claude-sonnet-4-20250514',
    system: systemMessage?.content,
    messages: chatMessages.map(m => ({
      role: m.role as 'user' | 'assistant',
      content: m.content,
    })),
    max_tokens: options.maxTokens ?? 1024,
    temperature: options.temperature ?? 0.7,
  });

  const textBlock = response.content.find(b => b.type === 'text');

  return {
    content: textBlock?.text ?? '',
    provider: 'anthropic',
    model: 'claude-sonnet-4-20250514',
    usage: {
      inputTokens: response.usage.input_tokens,
      outputTokens: response.usage.output_tokens,
    },
    latencyMs: Date.now() - start,
  };
}
// lib/adapters/google-adapter.ts
import { google } from '../ai-providers';
import { AIMessage, AIResponse, AIOptions } from '../ai-types';

export async function callGoogle(options: AIOptions): Promise<AIResponse> {
  const start = Date.now();
  const model = google.getGenerativeModel({ model: 'gemini-2.0-flash' });

  const systemMessage = options.messages.find(m => m.role === 'system');
  const chatMessages = options.messages.filter(m => m.role !== 'system');

  const chat = model.startChat({
    systemInstruction: systemMessage?.content,
    history: chatMessages.slice(0, -1).map(m => ({
      role: m.role === 'assistant' ? 'model' : 'user',
      parts: [{ text: m.content }],
    })),
  });

  const lastMessage = chatMessages[chatMessages.length - 1];
  const result = await chat.sendMessage(lastMessage.content);

  return {
    content: result.response.text(),
    provider: 'google',
    model: 'gemini-2.0-flash',
    usage: {
      inputTokens: result.response.usageMetadata?.promptTokenCount ?? 0,
      outputTokens: result.response.usageMetadata?.candidatesTokenCount ?? 0,
    },
    latencyMs: Date.now() - start,
  };
}

3. Router

With clean adapters in place, the router itself is straightforward. The core idea is an ordered list of providers to try: attempt the first, catch any error, move to the next. The order of that list is what encodes your routing strategy — quality-first, cost-first, or task-aware.

One thing to get right here: not all errors should trigger a fallback. A 400 Bad Request (malformed prompt) should be returned immediately — there's no point trying other providers with the same bad input. A 429 or 503 is a good signal to fall through to the next provider. In production you'd add error code inspection to the try/catch; the simplified version below catches everything to illustrate the pattern clearly.

Fallback Chain

// lib/ai-router.ts
import { callOpenAI } from './adapters/openai-adapter';
import { callAnthropic } from './adapters/anthropic-adapter';
import { callGoogle } from './adapters/google-adapter';
import { AIOptions, AIResponse } from './ai-types';

type Provider = 'openai' | 'anthropic' | 'google';

const providerMap = {
  openai: callOpenAI,
  anthropic: callAnthropic,
  google: callGoogle,
};

export async function callAI(
  options: AIOptions,
  providers: Provider[] = ['anthropic', 'openai', 'google']
): Promise<AIResponse> {
  let lastError: Error | null = null;

  for (const provider of providers) {
    try {
      const result = await providerMap[provider](options);
      return result;
    } catch (error: any) {
      console.error(`${provider} failed:`, error.message);
      lastError = error;
      // Continue to next provider
    }
  }

  throw new Error(`All providers failed. Last error: ${lastError?.message}`);
}

Task-Based Routing

// lib/ai-router.ts
type TaskType = 'code' | 'analysis' | 'creative' | 'simple' | 'long-context';

const taskRouting: Record<TaskType, Provider[]> = {
  code: ['anthropic', 'openai', 'google'],      // Claude excels at code
  analysis: ['anthropic', 'openai', 'google'],   // Claude for careful analysis
  creative: ['openai', 'anthropic', 'google'],   // GPT-4o for creative tasks
  simple: ['google', 'openai', 'anthropic'],     // Gemini Flash for simple/cheap
  'long-context': ['google', 'anthropic', 'openai'], // Gemini for long context
};

export async function callAIForTask(
  options: AIOptions,
  task: TaskType
): Promise<AIResponse> {
  return callAI(options, taskRouting[task]);
}

Cost-Optimized Routing

// Approximate costs per 1M tokens (input/output)
const providerCosts = {
  openai: { input: 2.50, output: 10.00 },     // GPT-4o
  anthropic: { input: 3.00, output: 15.00 },  // Claude Sonnet
  google: { input: 0.075, output: 0.30 },     // Gemini Flash
};

export async function callAICheap(options: AIOptions): Promise<AIResponse> {
  // Try cheapest first, fall back to more expensive
  return callAI(options, ['google', 'openai', 'anthropic']);
}

export async function callAIBest(options: AIOptions): Promise<AIResponse> {
  // Try highest quality first
  return callAI(options, ['anthropic', 'openai', 'google']);
}

4. API Route

The Next.js API route is a thin layer on top of the router. It accepts a task type from the client, which lets your frontend signal the quality-vs-cost trade-off it needs without knowing anything about which provider is used. The metadata returned in the response is important for debugging — log it or store it alongside the request so you can reconstruct what happened when something goes wrong.

// app/api/ai/route.ts
import { NextResponse } from 'next/server';
import { callAIForTask } from '@/lib/ai-router';

export async function POST(req: Request) {
  const { messages, task = 'simple' } = await req.json();

  try {
    const response = await callAIForTask({ messages }, task);

    return NextResponse.json({
      content: response.content,
      metadata: {
        provider: response.provider,
        model: response.model,
        latencyMs: response.latencyMs,
        usage: response.usage,
      },
    });
  } catch (error: any) {
    return NextResponse.json(
      { error: 'All AI providers failed', details: error.message },
      { status: 503 }
    );
  }
}

5. Cost Comparison

The price gap between Gemini Flash and the premium models is dramatic enough to drive real product decisions. For a classification or extraction task that doesn't require deep reasoning, Gemini Flash at $0.075/M input tokens is 33x cheaper than GPT-4o and 40x cheaper than Claude Sonnet. That difference matters when you're processing millions of documents. For a coding assistant or complex multi-step analysis, the quality difference justifies the premium — but many applications have a mix of task types, and intelligent routing lets you pay the right price for each one.

Note that these prices are for the models listed. Both OpenAI and Anthropic offer budget-tier models (GPT-4o mini, Claude Haiku) that sit in the middle of the range and are worth benchmarking for your specific use case before defaulting to the premium models.

ModelInput (per 1M tokens)Output (per 1M tokens)Best For
GPT-4o$2.50$10.00General purpose, creative
Claude Sonnet$3.00$15.00Code, analysis, instruction-following
Gemini Flash$0.075$0.30High-volume, cost-sensitive
GPT-4o mini$0.15$0.60Budget alternative to GPT-4o
Claude Haiku$0.25$1.25Budget alternative to Sonnet

Example: 1M input + 200K output tokens/month:

  • Gemini Flash: $0.14
  • GPT-4o mini: $0.27
  • Claude Haiku: $0.50
  • GPT-4o: $4.50
  • Claude Sonnet: $6.00

6. Monitoring

One of the underappreciated benefits of a multi-provider architecture is the observability it forces you to build. When every call goes through a single router, it's natural to attach metrics there. Track success rate per provider, average latency, token consumption, and estimated cost. Over time, this data tells you which provider is performing best for which task types, and whether your routing rules are still optimal.

The snippet below is intentionally minimal — in production you'd want to persist these metrics to a time-series store (Datadog, Grafana, or even a simple Postgres table) rather than keeping them in memory. But the shape of the data is what matters: per-provider, per-call records that you can aggregate into dashboards. For a fuller discussion of how to evaluate and compare providers using real traffic, see OpenRouter vs LiteLLM: Which Multi-Provider Gateway Is Right for You?.

// lib/ai-monitor.ts
interface ProviderMetrics {
  totalCalls: number;
  failures: number;
  avgLatencyMs: number;
  totalCost: number;
}

const metrics: Record<string, ProviderMetrics> = {};

export function recordCall(provider: string, latencyMs: number, tokens: number, failed: boolean) {
  if (!metrics[provider]) {
    metrics[provider] = { totalCalls: 0, failures: 0, avgLatencyMs: 0, totalCost: 0 };
  }
  const m = metrics[provider];
  m.totalCalls++;
  if (failed) m.failures++;
  m.avgLatencyMs = (m.avgLatencyMs * (m.totalCalls - 1) + latencyMs) / m.totalCalls;
}

Common Mistakes

The most common failure mode is implementing the fallback chain but not actually testing it. You deploy, everything looks fine, and then a provider outage hits and you discover that your fallback logic has a bug — maybe an environment variable is missing, or the adapter throws a different error type than expected. Test your fallbacks deliberately: temporarily revoke an API key and verify the router falls through cleanly.

The second most common issue is treating all errors as fallback-worthy. If your prompt exceeds the context window for one provider, it will almost certainly exceed it for all of them. Adding smarter error classification — returning immediately on 4xx errors that aren't rate limits, only falling through on 5xx and 429 — makes your fallback behavior more predictable and avoids unnecessary latency on genuinely broken requests.

For teams evaluating whether to build this themselves vs. using a managed gateway, the tradeoff analysis in Vercel AI SDK vs LangChain: Which Framework Fits Your Stack in 2026? covers the managed vs DIY dimension in depth.

MistakeImpactFix
No fallback chainApp breaks when one provider is downAlways have 2+ providers configured
Same model for every taskOverpaying for simple tasksRoute by task complexity
Not tracking costs per providerBudget surprisesLog tokens + cost per request
Not handling streaming differencesInconsistent UXUnified streaming adapter
Ignoring rate limits429 errors cascadePer-provider rate limiting

Provider SLA monitoring is the operational counterpart to fallback logic in code. Implement health checks that test each provider's API every 60-120 seconds with a minimal test request — a 5-token completion against your primary model. Route results to your monitoring dashboard and alert when any provider's success rate drops below 95%. The most common production failure isn't total provider unavailability — it's degraded performance where responses succeed but latency spikes to 10-30 seconds. Your fallback logic should trigger on latency thresholds, not just on explicit 4xx or 5xx errors. A provider serving 30-second responses is functionally unavailable for most user-facing applications even if it's technically returning HTTP 200. P95 latency is the metric to watch; P50 can look healthy while the tail experience is unacceptable.


Choosing an AI API? Compare OpenAI vs Anthropic vs Google Gemini on APIScout — pricing, quality, and performance benchmarks.

Compare OpenAI and Anthropic on APIScout.

The API Integration Checklist (Free PDF)

Step-by-step checklist: auth setup, rate limit handling, error codes, SDK evaluation, and pricing comparison for 50+ APIs. Used by 200+ developers.

Join 200+ developers. Unsubscribe in one click.