AI Gateway APIs: LiteLLM, Portkey, and Beyond 2026
The Rise of AI Gateway APIs: LiteLLM, Portkey, and Beyond
Managing multiple AI providers is a mess. Different SDKs, different response formats, different error codes, different rate limits. AI gateways solve this with a unified API layer — one interface to call any model from any provider, with built-in fallbacks, caching, and cost tracking.
The AI gateway category is new and evolving quickly. Two years ago, most teams managed provider diversity with a thin in-house wrapper. Today, the category has matured enough that purpose-built gateways provide features that would take weeks to replicate internally: semantic caching, intelligent model routing, guardrails, and observability dashboards with token-level cost attribution. Whether to build or buy a gateway layer is now a real architectural decision with clear tradeoffs — this guide covers both the landscape and that decision.
Why AI Gateways Exist
The Multi-Model Problem
Most production apps use multiple AI models:
Simple queries → Gemini Flash ($0.075/1M tokens) — cheap, fast
Complex reasoning → Claude Opus ($15/1M input) — highest quality
Code generation → Claude Sonnet ($3/1M input) — good balance
Embeddings → Cohere Embed ($0.10/1M tokens) — specialized
Image analysis → GPT-4o ($5/1M input) — best multimodal
Without a gateway, you need 5 different SDKs, 5 different auth mechanisms, 5 different error handling patterns, and manual routing logic.
What Gateways Provide
| Feature | Without Gateway | With Gateway |
|---|---|---|
| API interface | 5 different SDKs | 1 unified API |
| Fallback | Manual try/catch chains | Automatic failover |
| Cost tracking | Parse 5 different billing pages | Single dashboard |
| Caching | Build your own | Built-in semantic cache |
| Rate limiting | Handle per-provider | Unified rate management |
| Observability | 5 logging integrations | Single observability layer |
The Gateway Landscape
Open Source
| Gateway | Type | Key Feature | Stars |
|---|---|---|---|
| LiteLLM | Python proxy | 100+ model support, OpenAI-compatible | 15K+ |
| Portkey Gateway | Node.js proxy | Reliability, guardrails | 5K+ |
| Jan | Desktop app | Local + cloud models | 20K+ |
| AI Gateway (CF) | Edge proxy | Cloudflare-integrated | N/A |
Managed Platforms
| Platform | Focus | Pricing |
|---|---|---|
| Portkey | Reliability + observability | Free tier, then usage-based |
| Helicone | Observability + analytics | Free tier, then $50+/month |
| Braintrust | Evaluation + gateway | Free tier, then usage-based |
| Martian | Smart model routing | Usage-based |
| Not Diamond | Intelligent model selection | Per-request |
Cloud Provider Gateways
| Provider | Product | Models Available |
|---|---|---|
| AWS | Bedrock | Claude, Llama, Cohere, Mistral |
| Azure | AI Studio | GPT-4o, o3, Llama, Mistral |
| GCP | Vertex AI | Gemini, Claude, Llama |
How AI Gateways Work
LiteLLM Example
from litellm import completion
# Same interface for any provider
response = completion(
model="anthropic/claude-sonnet-4-20250514",
messages=[{"role": "user", "content": "Hello"}],
)
# Switch provider — same code
response = completion(
model="openai/gpt-4o",
messages=[{"role": "user", "content": "Hello"}],
)
# Or Gemini
response = completion(
model="gemini/gemini-2.0-flash",
messages=[{"role": "user", "content": "Hello"}],
)
LiteLLM Proxy (OpenAI-Compatible Server)
litellm --model anthropic/claude-sonnet-4-20250514
# Now any OpenAI-compatible client can connect
curl http://localhost:4000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "anthropic/claude-sonnet-4-20250514",
"messages": [{"role": "user", "content": "Hello"}]
}'
Portkey Example
import Portkey from 'portkey-ai';
const portkey = new Portkey({
apiKey: process.env.PORTKEY_API_KEY,
});
// Unified call with automatic retry + fallback
const response = await portkey.chat.completions.create({
model: 'claude-sonnet-4-20250514',
messages: [{ role: 'user', content: 'Hello' }],
// Portkey-specific config
config: {
retry: { attempts: 3, on_status_codes: [429, 500] },
cache: { mode: 'semantic', max_age: 3600 },
},
});
Key Gateway Features
1. Automatic Fallback
// If primary model fails, try fallbacks automatically
const config = {
strategy: {
mode: 'fallback',
on_status_codes: [429, 500, 503],
},
targets: [
{ provider: 'anthropic', model: 'claude-sonnet-4-20250514' },
{ provider: 'openai', model: 'gpt-4o' },
{ provider: 'google', model: 'gemini-2.0-flash' },
],
};
2. Load Balancing
// Distribute requests across providers
const config = {
strategy: {
mode: 'loadbalance',
},
targets: [
{ provider: 'anthropic', weight: 60 },
{ provider: 'openai', weight: 30 },
{ provider: 'google', weight: 10 },
],
};
3. Semantic Caching
// Cache similar queries to save money
// "What's the capital of France?" and "capital of France?"
// → Same cached response
const config = {
cache: {
mode: 'semantic',
max_age: 3600,
similarity_threshold: 0.95,
},
};
// Savings: 40-60% on repeated/similar queries
4. Cost Tracking
// Track spending per model, per user, per feature
const analytics = await gateway.getUsage({
timeRange: 'last_30_days',
groupBy: ['model', 'user', 'feature'],
});
// Result:
// {
// total_cost: $342.50,
// by_model: {
// 'claude-sonnet': { tokens: 5M, cost: $180 },
// 'gpt-4o': { tokens: 2M, cost: $100 },
// 'gemini-flash': { tokens: 10M, cost: $62.50 },
// },
// by_feature: {
// 'chat': $200,
// 'search': $80,
// 'summarization': $62.50,
// }
// }
5. Guardrails
// Block harmful content, PII, or off-topic responses
const config = {
guardrails: {
input: {
block_pii: true,
block_topics: ['violence', 'illegal'],
max_tokens: 4000,
},
output: {
block_pii: true,
require_citation: true,
max_tokens: 2000,
},
},
};
Guardrails at the gateway level are easier to maintain than guardrails in application code: you configure them once and they apply to every model call, regardless of which part of your application made the request. This is particularly valuable for PII detection — rather than auditing every prompt template in your codebase to ensure it doesn't leak user data to the LLM, you configure the gateway to redact PII before it ever reaches the model. The tradeoff: gateway-level guardrails are less context-aware than application-level checks, which can see the full request context and business logic.
Choosing a Gateway
| If You Need... | Choose | Why |
|---|---|---|
| Maximum model support | LiteLLM | 100+ models |
| Production reliability | Portkey | Enterprise-grade fallback + retry |
| Observability focus | Helicone | Best analytics and logging |
| Smart routing | Martian / Not Diamond | AI selects best model per request |
| AWS ecosystem | Bedrock | Native integration |
| Self-hosted, open-source | LiteLLM Proxy | Full control |
| Edge deployment | Cloudflare AI Gateway | Global edge, no origin |
Cost Impact
Before Gateway
Dev time managing 3 providers: 2 hours/week
Wasted spend from no caching: ~30% of AI budget
Downtime from single-provider dependency: 2-3 incidents/quarter
No visibility into per-feature costs: over-spending on non-critical features
After Gateway
Provider management: automated
Cache hit rate: 40-60% → direct cost savings
Uptime: automatic failover prevents most outages
Cost visibility: per-feature, per-user, per-model tracking
Typical savings: 30-50% on AI API costs through caching + model routing alone. The return on investment compounds: a team spending $5,000/month on AI APIs that achieves 40% cache hit rate saves $2,000/month — enough to justify a paid gateway tier within a month of implementation. For teams spending $50,000+ per month, the savings from even a modest cache hit rate cover the infrastructure cost of a self-hosted LiteLLM deployment many times over.
When Not to Use a Gateway
AI gateways add value, but they also add a layer of complexity and a potential single point of failure. Not every application needs one. The decision should be driven by concrete problems you're solving — cost visibility, provider resilience, or multi-model routing — not the idea that gateways are best practice. A simple application with a single AI provider and modest API spend has no meaningful problem to solve with a gateway.
Skip the gateway if: you use exactly one AI provider and have no plans to switch; your AI calls are low-volume (under 10,000/day) and cost visibility isn't a priority; your requests are latency-sensitive and the gateway adds more latency than you can accept; or you're building a prototype and adding the gateway would slow down iteration. In these cases, calling the provider's SDK directly is simpler, easier to debug, and doesn't add a dependency that could fail.
Add the gateway when: you're actively using multiple providers and want unified cost tracking; you've had provider outages that impacted your users and want automatic failover; your AI API costs are significant enough that 30-50% cache savings would be material; or you need per-team or per-user spending controls that the provider's dashboard doesn't support natively.
The risk most teams don't consider: the gateway itself becomes a critical dependency. A self-hosted LiteLLM instance that goes down takes all your AI features with it. Managed gateways (Portkey, Helicone) handle this with their own SLAs, but they add a vendor dependency. Design your fallback to fail open — if the gateway is unreachable, fall back to calling the provider directly rather than returning an error to users.
Self-Hosted vs Managed Gateway
The self-hosted vs managed tradeoff for AI gateways is sharper than for most infrastructure choices because AI API traffic carries sensitive data (user conversations, documents, prompts) that some organizations need to keep on their own infrastructure.
Self-hosted (LiteLLM Proxy): Deploy as a Docker container, configure your provider API keys, and you have a local OpenAI-compatible proxy. All traffic stays in your network — no third-party ever sees your prompts or responses. You're responsible for uptime, updates, and scaling. LiteLLM's configuration is YAML-based and reasonably simple for basic setups; it gets complex quickly for advanced routing rules with multiple providers and fallback chains. Works well for small-to-medium teams with a DevOps function.
Managed (Portkey, Helicone): Zero infrastructure to manage. Traffic routes through their servers, which is the core tradeoff: they do see your requests. Both providers have SOC 2 Type II certifications and data processing agreements, but regulated industries (healthcare, finance) or applications processing sensitive user data may still prefer self-hosted. Managed gateways typically have better UIs for analytics and are faster to set up — Portkey's dashboard shows per-model cost breakdowns, error rates, and latency percentiles out of the box.
Cloudflare AI Gateway: A distinct option: it's part of Cloudflare's edge infrastructure, with all the latency benefits of Cloudflare's global network. It's primarily for teams already on Cloudflare. Unlike LiteLLM (which proxies to your providers), Cloudflare AI Gateway is positioned as an observability and caching layer rather than a full multi-model routing solution.
Smart Model Routing in Practice
Model routing — sending different types of requests to different models — is the highest-value gateway feature, but requires careful implementation to get right.
The naive approach routes by task type defined in your code: "this is a summarization request, use Gemini Flash; this is a reasoning request, use Claude Sonnet." This works but requires you to label every request at call time, and gets brittle as your use cases evolve. The more sophisticated approach uses the request content and metadata to route automatically.
LLM-based routing: Tools like Martian and Not Diamond use a smaller, fast model to classify each request and select the best model for that request. The classifier itself typically runs on a sub-1B parameter model that can respond in under 10ms — fast enough to be imperceptible compared to the latency of the main model call. The classification model adds ~20ms of latency but can reduce costs by 40-60% by routing simple requests to cheaper models without manual labeling. The tradeoff: you're now running an LLM to route to your LLM, which adds cost and another failure point.
Confidence-based routing: Route based on the complexity or confidence of the request. A simple pattern: if your cheap model returns a low-confidence response (high token count, hedging language, requests for clarification), automatically retry with a more capable model. This is harder to implement than static routing but adapts to request complexity automatically.
Latency-based routing: In real-time applications where response speed matters, route to whichever provider currently has the lowest latency. AI APIs have variable latency depending on server load — Groq may be faster than Claude at 3pm on a Tuesday but slower during a traffic spike. Gateway-level latency routing can improve p95 response times by 20-40% for latency-sensitive applications.
Methodology
LiteLLM supports 100+ providers as documented in its litellm/providers module; the actual support depth varies — major providers (OpenAI, Anthropic, Google, Cohere) are well-tested while smaller providers may have edge cases. GitHub star counts are approximate as of early 2026 and change frequently. The 30-50% cache savings figure assumes semantic caching with a 0.95 cosine similarity threshold on conversational or support use cases; savings are lower for unique-per-user queries (creative tasks, personalized recommendations) and higher for FAQ-style queries where many users ask the same question with slightly different wording. Portkey's free tier includes 10,000 requests/month; Helicone's free tier includes 100,000 requests/month.
Common Mistakes
| Mistake | Impact | Fix |
|---|---|---|
| Not caching identical requests | 30-50% wasted spend | Enable semantic caching |
| Using frontier model for all tasks | 10x overspend | Route simple tasks to cheap models |
| No fallback configured | Outage when primary provider goes down | Set up at least 2 fallback providers |
| Ignoring token usage by feature | Can't optimize | Track per-feature costs |
| Gateway as single point of failure | Gateway down = everything down | Self-host or use multiple gateway instances |
Compare AI gateways and model providers on APIScout — pricing, model support, reliability, and developer experience.
Related: How Open-Source AI Models Are Disrupting Closed APIs, Rise of Developer-First APIs: What Makes Them Different, API Cost Optimization