Skip to main content

AI Gateway APIs: LiteLLM, Portkey, and Beyond 2026

·APIScout Team
Share:

The Rise of AI Gateway APIs: LiteLLM, Portkey, and Beyond

Managing multiple AI providers is a mess. Different SDKs, different response formats, different error codes, different rate limits. AI gateways solve this with a unified API layer — one interface to call any model from any provider, with built-in fallbacks, caching, and cost tracking.

The AI gateway category is new and evolving quickly. Two years ago, most teams managed provider diversity with a thin in-house wrapper. Today, the category has matured enough that purpose-built gateways provide features that would take weeks to replicate internally: semantic caching, intelligent model routing, guardrails, and observability dashboards with token-level cost attribution. Whether to build or buy a gateway layer is now a real architectural decision with clear tradeoffs — this guide covers both the landscape and that decision.

Why AI Gateways Exist

The Multi-Model Problem

Most production apps use multiple AI models:

Simple queries     → Gemini Flash ($0.075/1M tokens) — cheap, fast
Complex reasoning  → Claude Opus ($15/1M input) — highest quality
Code generation    → Claude Sonnet ($3/1M input) — good balance
Embeddings        → Cohere Embed ($0.10/1M tokens) — specialized
Image analysis    → GPT-4o ($5/1M input) — best multimodal

Without a gateway, you need 5 different SDKs, 5 different auth mechanisms, 5 different error handling patterns, and manual routing logic.

What Gateways Provide

FeatureWithout GatewayWith Gateway
API interface5 different SDKs1 unified API
FallbackManual try/catch chainsAutomatic failover
Cost trackingParse 5 different billing pagesSingle dashboard
CachingBuild your ownBuilt-in semantic cache
Rate limitingHandle per-providerUnified rate management
Observability5 logging integrationsSingle observability layer

The Gateway Landscape

Open Source

GatewayTypeKey FeatureStars
LiteLLMPython proxy100+ model support, OpenAI-compatible15K+
Portkey GatewayNode.js proxyReliability, guardrails5K+
JanDesktop appLocal + cloud models20K+
AI Gateway (CF)Edge proxyCloudflare-integratedN/A

Managed Platforms

PlatformFocusPricing
PortkeyReliability + observabilityFree tier, then usage-based
HeliconeObservability + analyticsFree tier, then $50+/month
BraintrustEvaluation + gatewayFree tier, then usage-based
MartianSmart model routingUsage-based
Not DiamondIntelligent model selectionPer-request

Cloud Provider Gateways

ProviderProductModels Available
AWSBedrockClaude, Llama, Cohere, Mistral
AzureAI StudioGPT-4o, o3, Llama, Mistral
GCPVertex AIGemini, Claude, Llama

How AI Gateways Work

LiteLLM Example

from litellm import completion

# Same interface for any provider
response = completion(
    model="anthropic/claude-sonnet-4-20250514",
    messages=[{"role": "user", "content": "Hello"}],
)

# Switch provider — same code
response = completion(
    model="openai/gpt-4o",
    messages=[{"role": "user", "content": "Hello"}],
)

# Or Gemini
response = completion(
    model="gemini/gemini-2.0-flash",
    messages=[{"role": "user", "content": "Hello"}],
)

LiteLLM Proxy (OpenAI-Compatible Server)

litellm --model anthropic/claude-sonnet-4-20250514

# Now any OpenAI-compatible client can connect
curl http://localhost:4000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "anthropic/claude-sonnet-4-20250514",
    "messages": [{"role": "user", "content": "Hello"}]
  }'

Portkey Example

import Portkey from 'portkey-ai';

const portkey = new Portkey({
  apiKey: process.env.PORTKEY_API_KEY,
});

// Unified call with automatic retry + fallback
const response = await portkey.chat.completions.create({
  model: 'claude-sonnet-4-20250514',
  messages: [{ role: 'user', content: 'Hello' }],
  // Portkey-specific config
  config: {
    retry: { attempts: 3, on_status_codes: [429, 500] },
    cache: { mode: 'semantic', max_age: 3600 },
  },
});

Key Gateway Features

1. Automatic Fallback

// If primary model fails, try fallbacks automatically
const config = {
  strategy: {
    mode: 'fallback',
    on_status_codes: [429, 500, 503],
  },
  targets: [
    { provider: 'anthropic', model: 'claude-sonnet-4-20250514' },
    { provider: 'openai', model: 'gpt-4o' },
    { provider: 'google', model: 'gemini-2.0-flash' },
  ],
};

2. Load Balancing

// Distribute requests across providers
const config = {
  strategy: {
    mode: 'loadbalance',
  },
  targets: [
    { provider: 'anthropic', weight: 60 },
    { provider: 'openai', weight: 30 },
    { provider: 'google', weight: 10 },
  ],
};

3. Semantic Caching

// Cache similar queries to save money
// "What's the capital of France?" and "capital of France?"
// → Same cached response
const config = {
  cache: {
    mode: 'semantic',
    max_age: 3600,
    similarity_threshold: 0.95,
  },
};
// Savings: 40-60% on repeated/similar queries

4. Cost Tracking

// Track spending per model, per user, per feature
const analytics = await gateway.getUsage({
  timeRange: 'last_30_days',
  groupBy: ['model', 'user', 'feature'],
});

// Result:
// {
//   total_cost: $342.50,
//   by_model: {
//     'claude-sonnet': { tokens: 5M, cost: $180 },
//     'gpt-4o': { tokens: 2M, cost: $100 },
//     'gemini-flash': { tokens: 10M, cost: $62.50 },
//   },
//   by_feature: {
//     'chat': $200,
//     'search': $80,
//     'summarization': $62.50,
//   }
// }

5. Guardrails

// Block harmful content, PII, or off-topic responses
const config = {
  guardrails: {
    input: {
      block_pii: true,
      block_topics: ['violence', 'illegal'],
      max_tokens: 4000,
    },
    output: {
      block_pii: true,
      require_citation: true,
      max_tokens: 2000,
    },
  },
};

Guardrails at the gateway level are easier to maintain than guardrails in application code: you configure them once and they apply to every model call, regardless of which part of your application made the request. This is particularly valuable for PII detection — rather than auditing every prompt template in your codebase to ensure it doesn't leak user data to the LLM, you configure the gateway to redact PII before it ever reaches the model. The tradeoff: gateway-level guardrails are less context-aware than application-level checks, which can see the full request context and business logic.

Choosing a Gateway

If You Need...ChooseWhy
Maximum model supportLiteLLM100+ models
Production reliabilityPortkeyEnterprise-grade fallback + retry
Observability focusHeliconeBest analytics and logging
Smart routingMartian / Not DiamondAI selects best model per request
AWS ecosystemBedrockNative integration
Self-hosted, open-sourceLiteLLM ProxyFull control
Edge deploymentCloudflare AI GatewayGlobal edge, no origin

Cost Impact

Before Gateway

Dev time managing 3 providers: 2 hours/week
Wasted spend from no caching: ~30% of AI budget
Downtime from single-provider dependency: 2-3 incidents/quarter
No visibility into per-feature costs: over-spending on non-critical features

After Gateway

Provider management: automated
Cache hit rate: 40-60% → direct cost savings
Uptime: automatic failover prevents most outages
Cost visibility: per-feature, per-user, per-model tracking

Typical savings: 30-50% on AI API costs through caching + model routing alone. The return on investment compounds: a team spending $5,000/month on AI APIs that achieves 40% cache hit rate saves $2,000/month — enough to justify a paid gateway tier within a month of implementation. For teams spending $50,000+ per month, the savings from even a modest cache hit rate cover the infrastructure cost of a self-hosted LiteLLM deployment many times over.

When Not to Use a Gateway

AI gateways add value, but they also add a layer of complexity and a potential single point of failure. Not every application needs one. The decision should be driven by concrete problems you're solving — cost visibility, provider resilience, or multi-model routing — not the idea that gateways are best practice. A simple application with a single AI provider and modest API spend has no meaningful problem to solve with a gateway.

Skip the gateway if: you use exactly one AI provider and have no plans to switch; your AI calls are low-volume (under 10,000/day) and cost visibility isn't a priority; your requests are latency-sensitive and the gateway adds more latency than you can accept; or you're building a prototype and adding the gateway would slow down iteration. In these cases, calling the provider's SDK directly is simpler, easier to debug, and doesn't add a dependency that could fail.

Add the gateway when: you're actively using multiple providers and want unified cost tracking; you've had provider outages that impacted your users and want automatic failover; your AI API costs are significant enough that 30-50% cache savings would be material; or you need per-team or per-user spending controls that the provider's dashboard doesn't support natively.

The risk most teams don't consider: the gateway itself becomes a critical dependency. A self-hosted LiteLLM instance that goes down takes all your AI features with it. Managed gateways (Portkey, Helicone) handle this with their own SLAs, but they add a vendor dependency. Design your fallback to fail open — if the gateway is unreachable, fall back to calling the provider directly rather than returning an error to users.

Self-Hosted vs Managed Gateway

The self-hosted vs managed tradeoff for AI gateways is sharper than for most infrastructure choices because AI API traffic carries sensitive data (user conversations, documents, prompts) that some organizations need to keep on their own infrastructure.

Self-hosted (LiteLLM Proxy): Deploy as a Docker container, configure your provider API keys, and you have a local OpenAI-compatible proxy. All traffic stays in your network — no third-party ever sees your prompts or responses. You're responsible for uptime, updates, and scaling. LiteLLM's configuration is YAML-based and reasonably simple for basic setups; it gets complex quickly for advanced routing rules with multiple providers and fallback chains. Works well for small-to-medium teams with a DevOps function.

Managed (Portkey, Helicone): Zero infrastructure to manage. Traffic routes through their servers, which is the core tradeoff: they do see your requests. Both providers have SOC 2 Type II certifications and data processing agreements, but regulated industries (healthcare, finance) or applications processing sensitive user data may still prefer self-hosted. Managed gateways typically have better UIs for analytics and are faster to set up — Portkey's dashboard shows per-model cost breakdowns, error rates, and latency percentiles out of the box.

Cloudflare AI Gateway: A distinct option: it's part of Cloudflare's edge infrastructure, with all the latency benefits of Cloudflare's global network. It's primarily for teams already on Cloudflare. Unlike LiteLLM (which proxies to your providers), Cloudflare AI Gateway is positioned as an observability and caching layer rather than a full multi-model routing solution.

Smart Model Routing in Practice

Model routing — sending different types of requests to different models — is the highest-value gateway feature, but requires careful implementation to get right.

The naive approach routes by task type defined in your code: "this is a summarization request, use Gemini Flash; this is a reasoning request, use Claude Sonnet." This works but requires you to label every request at call time, and gets brittle as your use cases evolve. The more sophisticated approach uses the request content and metadata to route automatically.

LLM-based routing: Tools like Martian and Not Diamond use a smaller, fast model to classify each request and select the best model for that request. The classifier itself typically runs on a sub-1B parameter model that can respond in under 10ms — fast enough to be imperceptible compared to the latency of the main model call. The classification model adds ~20ms of latency but can reduce costs by 40-60% by routing simple requests to cheaper models without manual labeling. The tradeoff: you're now running an LLM to route to your LLM, which adds cost and another failure point.

Confidence-based routing: Route based on the complexity or confidence of the request. A simple pattern: if your cheap model returns a low-confidence response (high token count, hedging language, requests for clarification), automatically retry with a more capable model. This is harder to implement than static routing but adapts to request complexity automatically.

Latency-based routing: In real-time applications where response speed matters, route to whichever provider currently has the lowest latency. AI APIs have variable latency depending on server load — Groq may be faster than Claude at 3pm on a Tuesday but slower during a traffic spike. Gateway-level latency routing can improve p95 response times by 20-40% for latency-sensitive applications.

Methodology

LiteLLM supports 100+ providers as documented in its litellm/providers module; the actual support depth varies — major providers (OpenAI, Anthropic, Google, Cohere) are well-tested while smaller providers may have edge cases. GitHub star counts are approximate as of early 2026 and change frequently. The 30-50% cache savings figure assumes semantic caching with a 0.95 cosine similarity threshold on conversational or support use cases; savings are lower for unique-per-user queries (creative tasks, personalized recommendations) and higher for FAQ-style queries where many users ask the same question with slightly different wording. Portkey's free tier includes 10,000 requests/month; Helicone's free tier includes 100,000 requests/month.

Common Mistakes

MistakeImpactFix
Not caching identical requests30-50% wasted spendEnable semantic caching
Using frontier model for all tasks10x overspendRoute simple tasks to cheap models
No fallback configuredOutage when primary provider goes downSet up at least 2 fallback providers
Ignoring token usage by featureCan't optimizeTrack per-feature costs
Gateway as single point of failureGateway down = everything downSelf-host or use multiple gateway instances

Compare AI gateways and model providers on APIScout — pricing, model support, reliability, and developer experience.

Related: How Open-Source AI Models Are Disrupting Closed APIs, Rise of Developer-First APIs: What Makes Them Different, API Cost Optimization

The API Integration Checklist (Free PDF)

Step-by-step checklist: auth setup, rate limit handling, error codes, SDK evaluation, and pricing comparison for 50+ APIs. Used by 200+ developers.

Join 200+ developers. Unsubscribe in one click.