An AI gateway sits between your application and LLM providers (OpenAI, Anthropic, Gemini, etc.). Instead of calling providers directly, your app calls the gateway, which handles: Load balancing across providers/models Fallbacks when a provider is down or rate-limited Caching to avoid redundant LLM calls Observability (cost tracking, latency, error rates) Rate limiting and quota management Authentication and API key management The question is whether you want a gateway that _understands_ LLMs (to

Portkey vs Kong AI Gateway: LLM Routing APIs 2026

TL;DR

Portkey if you're building LLM-first — it understands tokens, models, and prompts natively. Fallbacks, retries, semantic caching, cost routing, and per-model observability are all built-in. Kong if you're already running Kong for traditional APIs — it's 228% faster (measured), and you can extend familiar API gateway patterns to AI workloads, though you sacrifice LLM-aware features. LiteLLM is the open-source alternative that competes with both. For most AI-native teams in 2026, Portkey wins on features and developer experience despite the latency overhead.

Key Takeaways

Portkey latency overhead: ~20–40ms (AI-native features cost latency)
Kong latency advantage: 228% faster than Portkey, 65% lower latency in benchmarks
Portkey pricing: Free (10K requests/month) → $49/month (100K requests) → Enterprise
Kong AI Gateway pricing: Complex multi-dimensional model — gateway + request + plugin fees; $30+ per million requests at scale
Semantic caching: Portkey has it natively (exact + semantic); Kong requires custom plugin
Token observability: Portkey (native); Kong (treats requests as opaque blobs, no token-level data)
Model fallbacks: Portkey (native, configured in JSON); Kong (requires custom Lua/Python plugin)

What AI Gateways Do

An AI gateway sits between your application and LLM providers (OpenAI, Anthropic, Gemini, etc.). Instead of calling providers directly, your app calls the gateway, which handles:

Load balancing across providers/models
Fallbacks when a provider is down or rate-limited
Caching to avoid redundant LLM calls
Observability (cost tracking, latency, error rates)
Rate limiting and quota management
Authentication and API key management

The question is whether you want a gateway that understands LLMs (tokens, models, prompts) or a gateway that treats LLM calls as generic HTTP requests.

Portkey: AI-Native Gateway

Getting Started

import Portkey from 'portkey-ai';

// Drop-in replacement for OpenAI SDK
const portkey = new Portkey({
  apiKey: process.env.PORTKEY_API_KEY,
  virtualKey: 'openai-key-abc123', // Your stored OpenAI key
});

// Works exactly like OpenAI SDK
const completion = await portkey.chat.completions.create({
  model: 'gpt-4o',
  messages: [{ role: 'user', content: 'Hello!' }],
});

For existing OpenAI SDK code, add Portkey as a base URL:

import OpenAI from 'openai';

const client = new OpenAI({
  apiKey: process.env.PORTKEY_API_KEY,
  baseURL: 'https://api.portkey.ai/v1',
  defaultHeaders: {
    'x-portkey-virtual-key': 'openai-key-abc123',
  },
});

// Your existing OpenAI code works unchanged
const response = await client.chat.completions.create({
  model: 'gpt-4o',
  messages: [{ role: 'user', content: 'Write a haiku.' }],
});

Fallback Configuration (Configs)

Portkey's most powerful feature is declarative routing via JSON configs:

// Portkey Config: fallback across providers
{
  "strategy": {
    "mode": "fallback"
  },
  "targets": [
    {
      "virtual_key": "openai-key-abc123",
      "model": "gpt-4o",
      "override_params": { "max_tokens": 4096 }
    },
    {
      "virtual_key": "anthropic-key-abc123",
      "model": "claude-3-5-sonnet-20241022"
    },
    {
      "virtual_key": "groq-key-abc123",
      "model": "llama-3.3-70b-versatile"
    }
  ]
}

// Use the config in your app
const portkey = new Portkey({
  apiKey: process.env.PORTKEY_API_KEY,
  config: 'config-id-from-dashboard',
});

// If OpenAI is down → falls back to Anthropic → then Groq
const response = await portkey.chat.completions.create({
  messages: [{ role: 'user', content: 'Help me debug this code.' }],
  // model is determined by the config's primary target
});

Load Balancing and Cost Routing

// Load balance between GPT-4o and GPT-4o-mini by cost
{
  "strategy": {
    "mode": "loadbalance"
  },
  "targets": [
    {
      "virtual_key": "openai-key-abc123",
      "model": "gpt-4o-mini",
      "weight": 0.8  // 80% of traffic
    },
    {
      "virtual_key": "openai-key-abc123",
      "model": "gpt-4o",
      "weight": 0.2  // 20% for complex queries
    }
  ]
}

Semantic Caching

Portkey's caching goes beyond exact-match:

const portkey = new Portkey({
  apiKey: process.env.PORTKEY_API_KEY,
  config: {
    cache: {
      mode: 'semantic',  // Fuzzy match on similar prompts
      maxAge: 3600,      // Cache TTL in seconds
      // "What is the capital of France?" and
      // "Tell me the capital city of France" both hit cache
    },
  },
});

// Or exact-match for deterministic prompts
const portkeyExact = new Portkey({
  apiKey: process.env.PORTKEY_API_KEY,
  config: {
    cache: {
      mode: 'exact',
      maxAge: 86400,
    },
  },
});

Token and Cost Observability

// Portkey tracks token usage, costs, and latency automatically
// Access via dashboard or API

const response = await portkey.chat.completions.create({
  model: 'gpt-4o',
  messages: [{ role: 'user', content: prompt }],
  // Attach metadata for filtering in dashboard
});

// Via headers
const portkeyWithMeta = new Portkey({
  apiKey: process.env.PORTKEY_API_KEY,
  traceId: `trace-${Date.now()}`,      // Custom trace ID
  metadata: JSON.stringify({
    userId: user.id,
    featureName: 'code-review',
    environment: 'production',
  }),
});

// Query costs programmatically
const analytics = await portkey.analytics.list({
  startDate: '2026-03-01',
  endDate: '2026-03-15',
  groupBy: 'model',
});
// Returns: cost breakdown by model, user, feature, etc.

Guardrails

// Add content safety without changing your app code
const config = {
  guardrails: {
    input: [
      {
        type: 'regex',
        pattern: '(credit card|SSN|social security)',
        action: 'block',
        message: 'PII detected in input',
      },
    ],
    output: [
      {
        type: 'pii_detection',
        action: 'redact',
      },
    ],
  },
};

Kong AI Gateway

Kong is the enterprise API gateway that added AI capabilities. Its architecture treats LLM calls as enhanced HTTP requests through the same plugin infrastructure used for traditional APIs.

Setup

# kong.yml
_format_version: "3.0"

services:
  - name: openai-service
    url: https://api.openai.com
    routes:
      - name: chat-completions
        paths:
          - /v1/chat/completions

plugins:
  - name: ai-proxy
    service: openai-service
    config:
      route_type: llm/v1/chat
      auth:
        header_name: Authorization
        header_value: Bearer ${OPENAI_API_KEY}
      model:
        provider: openai
        name: gpt-4o
        options:
          max_tokens: 4096
          temperature: 0.7

AI Proxy Plugin

# AI proxy with provider fallback
plugins:
  - name: ai-proxy-advanced
    config:
      targets:
        - model:
            provider: openai
            name: gpt-4o
          auth:
            header_name: Authorization
            header_value: Bearer ${OPENAI_API_KEY}
          weight: 100
        - model:
            provider: anthropic
            name: claude-3-5-sonnet-20241022
          auth:
            header_name: x-api-key
            header_value: ${ANTHROPIC_API_KEY}
          weight: 0  # Failover only

      balancer:
        algorithm: round-robin

Rate Limiting

Kong's rate limiting (its traditional strength):

plugins:
  - name: rate-limiting
    config:
      second: 10
      minute: 100
      hour: 1000
      policy: redis
      redis:
        host: redis-host
        port: 6379

What Kong Lacks for LLMs

Traditional API Gateway metrics Kong tracks:
  ✅ HTTP status codes
  ✅ Request/response latency
  ✅ Requests per second
  ✅ Bandwidth bytes

LLM-specific metrics Kong does NOT track:
  ❌ Token count (input/output)
  ❌ Token cost (no model pricing knowledge)
  ❌ Prompt content (opaque blob)
  ❌ Model-aware routing (no semantic understanding)
  ❌ Cache hits based on semantic similarity
  ❌ Cost per request to different providers

Kong sees: POST /v1/chat/completions with 2KB body
Portkey sees: 850 input tokens → gpt-4o → 320 output tokens → $0.0053 total

Performance Benchmark

Kong's published benchmark (Kong vs Portkey vs LiteLLM):

Environment: AWS, same region as OpenAI API

Latency (p50):
  Kong:      12ms overhead
  Portkey:   27ms overhead  (+125% vs Kong)
  LiteLLM:   95ms overhead  (+692% vs Kong)

Throughput (requests/second):
  Kong:      8,200 req/s
  Portkey:   2,100 req/s
  LiteLLM:   890 req/s

For real-time chat applications (200ms total budget):
  12ms gateway overhead = 6% of budget (acceptable)
  27ms gateway overhead = 13.5% of budget (meaningful)
  95ms gateway overhead = 47.5% of budget (problematic)

Context: This benchmark was published by Kong. Independent benchmarks show smaller differences. For most SaaS applications with 500ms+ round-trip LLM calls, 15ms extra overhead is irrelevant.

Pricing Comparison

Portkey Pricing:
  Free:        10,000 requests/month + 30-day logs
  Starter:     $49/month — 100K requests + 90-day logs
  Business:    $249/month — 1M requests + advanced guardrails
  Enterprise:  Custom — unlimited + HIPAA + dedicated infra

  Overage: $9 per additional 100K requests (Starter)

Kong Pricing:
  Free (self-hosted): Open-source, no request limits
  Konnect (cloud):    $0.016+ per unit (complex pricing)
  Enterprise:         Multi-dimensional: $0.068/unit in Starter tier

  Note: Kong Konnect has 5+ pricing dimensions
  (gateway services, request units, paid plugins, premium plugins)
  making total cost unpredictable at scale.
  Reports of $30+/million requests for full feature stack.

Feature Comparison

Feature	Portkey	Kong AI Gateway
Latency overhead	~20–40ms	~8–15ms
Throughput	~2,100 req/s	~8,200 req/s
Model fallbacks	✅ Native (JSON config)	⚠️ Via custom plugin
Token observability	✅ Native	❌ Not available
Cost tracking	✅ Per-request, per-model	❌ Not available
Semantic caching	✅ Native	❌ Not available
Exact-match caching	✅	✅ (ai-semantic-cache plugin)
Provider normalization	✅ (OpenAI-compatible for all)	⚠️ Per-provider config
Prompt management	✅ Prompt hub + versioning	❌
Guardrails	✅ Native	❌ Custom plugin required
RBAC / teams	✅	✅ Enterprise
SOC 2 Type II	✅	✅
HIPAA	✅ Enterprise	✅ Enterprise
Self-hosted	✅ (open-source version)	✅ (open-source)
Traditional API routing	⚠️ Limited	✅ Full-featured
Learning curve	Low (JSON configs)	High (Lua/YAML/Admin API)

LiteLLM: The Open-Source Option

Worth mentioning: LiteLLM is the open-source alternative that both Portkey and Kong compete against:

from litellm import completion

# Unified interface across 100+ providers
response = completion(
    model='openai/gpt-4o',
    messages=[{'role': 'user', 'content': 'Hello!'}],
    # Fallback
    fallbacks=['anthropic/claude-3-5-sonnet-20241022', 'groq/llama-3.3-70b'],
)

# Or run LiteLLM proxy server
# litellm --model openai/gpt-4o
# Any OpenAI-compatible client works against http://localhost:4000

LiteLLM is slower than both Portkey and Kong (95ms+ overhead) but $0 for self-hosted deployments with full features.

Decision Guide

Choose Portkey if:

Your primary concern is LLM-specific features (fallbacks, cost tracking, semantic cache)
You're building AI-first applications where understanding token costs matters
You want a fast setup — JSON configs, no DevOps
HIPAA or SOC 2 is required on the managed platform
Budget: $49–$249/month is acceptable

Choose Kong if:

You're already running Kong for traditional APIs — unify AI and HTTP routing
Latency performance at extreme scale (10K+ req/s) is critical
You have DevOps capacity to write plugins for LLM-specific needs
You want open-source self-hosting with no per-request fees

Choose LiteLLM if:

You're cost-sensitive and have DevOps capacity
You want full LLM-aware features (fallbacks, load balancing) at $0 cost
Latency overhead is acceptable in your use case

Browse all AI gateway and LLM infrastructure APIs at APIScout.

Prompt Management and Versioning

One capability with no Kong equivalent is Portkey's Prompt Hub — a central repository for storing, versioning, and A/B testing prompts across your application. When the system prompt for your customer support bot needs updating, you edit it in the Portkey dashboard and every deployment picks up the change without a code deploy. You can also A/B test prompt variants against production traffic and measure output quality differences before committing to a change.

This matters more than it sounds. Prompt engineering in production is an ongoing process: prompts that work well during development often degrade over time as model providers update their underlying models. Having prompt versioning separate from application code means prompt changes don't require engineering involvement. Marketing and content teams can iterate on prompt templates within guardrails you've defined — model, max tokens, guardrail rules — while the engineering team owns the routing and reliability layer.

Kong has no native equivalent. Building similar functionality on top of Kong requires a separate prompt registry (database, versioning system, deployment pipeline) that you build and maintain. For LLM-first teams, this operational overhead is a concrete argument for Portkey despite the latency difference.

Observability and Cost Attribution

The observability gap between Portkey and Kong is most visible in production cost attribution. Portkey tracks cost at the request level — model, token count, user ID, feature name, or any metadata you attach — allowing you to answer "how much does our AI-powered search feature cost per user per day?" Kong's observability is request-count and latency based: it can tell you that 10,000 requests went through in the last hour, but not what they cost or which features drove the expense.

For SaaS products that charge usage-based fees or need to attribute AI costs to customer accounts, Portkey's per-request cost tracking is a direct revenue and margin management tool. Without it, teams resort to estimating AI costs from cloud bills — which provides aggregate data with 24-48 hour delay, not the real-time per-request granularity needed to implement per-user rate limits or usage-based billing.

Methodology

Benchmark figures (228% latency advantage, throughput numbers) are from Kong's published AI Gateway performance report, conducted in AWS US-East-1 with the same region OpenAI endpoint. Independent benchmarks from the LLM infrastructure community show smaller latency gaps (50-100% rather than 228%) — Kong's published figures represent best-case conditions. Pricing sourced from Portkey and Kong Konnect pricing pages as of March 2026; both providers change pricing frequently. Feature comparison verified against Portkey docs (portkey.ai/docs) and Kong AI Gateway plugin documentation.

Compare OpenAI and Anthropic on APIScout.

The API Integration Checklist (Free PDF)