API Rate Limiting Best Practices for 2026

Q: How to Choose the Right Algorithm?

Use this decision framework: Do you need to allow bursts? Yes --> Token bucket. No --> Leaky bucket. Is memory a constraint at your scale? Yes --> Fixed window or sliding window counter. No --> Sliding window log. Do you need exact counts for billing or compliance? Yes --> Sliding window log or token bucket. No --> Sliding window counter. Is implementation simplicity the priority? Yes --> Fixed window. No --> Token bucket. Are you shaping traffic for a downstream system that cannot handle spikes

Rate limiting is not optional. Every production API needs it — to protect infrastructure, ensure fair access, control costs, and prevent abuse. The question is not whether to rate limit, but which algorithm to use, how to communicate limits to clients, and how to implement it at scale. This guide compares five rate limiting algorithms head-to-head, walks through a Redis-based implementation, and covers the headers, patterns, and client-side handling that separate well-designed APIs from frustrating ones.

TL;DR

Use token bucket for most APIs — it handles bursts gracefully while enforcing an average rate, and it is the algorithm behind Stripe and AWS. Communicate limits with X-RateLimit-Limit, X-RateLimit-Remaining, X-RateLimit-Reset, and Retry-After headers on every response. Implement with Redis for distributed systems. Design tier-based limits from day one, and always return 429 Too Many Requests (not 403 or 500) when a client exceeds its quota.

Key Takeaways

Five algorithms exist, each with distinct tradeoffs around accuracy, memory, burst handling, and implementation complexity.
Token bucket is the industry standard for a reason: it allows controlled bursts without sacrificing average rate enforcement.
Standard headers (X-RateLimit-* and Retry-After) are non-negotiable. Clients cannot self-throttle without them.
Redis is the go-to backend for distributed rate limiting — atomic operations, built-in TTL, and sub-millisecond latency.
Tier-based limits enable monetization. Free, Pro, and Enterprise plans should have distinct rate allocations.
Exponential backoff with jitter is the correct client-side pattern. Immediate retries cause thundering herd problems.

Algorithm Comparison

All five algorithms solve the same problem — controlling request throughput — but they make different tradeoffs.

Algorithm	Burst Handling	Accuracy	Memory Usage	Complexity	Used By
Fixed Window	Poor (burst at edges)	Approximate	Low	Very low	Simple internal APIs
Sliding Window Log	Excellent	Exact	High	Medium	High-precision systems
Sliding Window Counter	Good	Approximate	Low	Medium	Cloudflare
Token Bucket	Excellent (configurable)	Exact	Low	Medium	Stripe, AWS
Leaky Bucket	None (smoothed)	Exact	Low	Medium	Traffic shaping, Shopify

Deep Dive: Each Algorithm

Fixed Window

The simplest approach. Divide time into fixed intervals (e.g., 1-minute windows starting at :00, :01, :02) and count requests per window. When the count hits the limit, reject until the next window.

How it works: A counter increments for each request within a window. At the window boundary, the counter resets to zero.

The burst problem: A client can send 100 requests at 12:00:59 and 100 more at 12:01:00 — 200 requests in 2 seconds while technically respecting a 100-per-minute limit. This happens because the counter resets at the boundary, allowing a double-burst.

When to use it: Internal APIs or prototypes where simplicity outweighs precision. Not recommended for public APIs where burst behavior matters.

Sliding Window Log

Track the timestamp of every request. To check if a new request is allowed, count all timestamps within the last N seconds from now.

How it works: Each request's timestamp is stored in a sorted set. On each new request, remove entries older than the window, count remaining entries, and compare against the limit.

The accuracy advantage: No boundary problem. The window always represents exactly the last N seconds relative to the current moment. A request at 12:00:59 and a request at 12:01:00 are both measured against a window that looks back exactly 60 seconds.

The memory cost: Every request timestamp must be stored. At 10,000 requests per minute per client, that is 10,000 entries per client per window. For APIs with thousands of clients, this scales quickly.

When to use it: When exact rate enforcement is critical and the request volume per client is moderate — billing APIs, authentication endpoints, or compliance-sensitive systems.

Sliding Window Counter

A hybrid of fixed window and sliding window. Instead of storing every timestamp, keep counters for the current and previous fixed windows. Estimate the sliding window count using a weighted average.

How it works: If the current window is 40% elapsed, the estimated count is (previous_window_count * 0.6) + current_window_count. This approximates a true sliding window with the memory footprint of a fixed window.

The tradeoff: The count is an approximation, not an exact number. In practice, the error is small enough for most use cases. Cloudflare uses this approach across their network — it handles billions of requests per day with minimal memory overhead.

When to use it: When you need better accuracy than fixed window but cannot afford the memory overhead of sliding window log. This is the sweet spot for most medium-scale APIs.

Token Bucket

A bucket holds tokens, filled at a steady rate (e.g., 10 tokens per second). Each request consumes one token. If the bucket is empty, the request is rejected. The bucket has a maximum capacity, which controls burst size.

How it works: Two parameters define the behavior — the refill rate (tokens per second) and the bucket capacity (maximum tokens). A client can burst up to the capacity, then must wait for tokens to refill. Over time, the average rate converges to the refill rate.

Why Stripe uses it: Token bucket allows a client to send a burst of requests (e.g., creating 25 subscriptions in rapid succession during a batch operation) without being throttled, as long as the client's average rate stays within limits. This matches real-world API usage patterns where traffic is bursty, not uniform.

Configuration example:

Refill rate: 100 tokens per second
Bucket capacity: 250 tokens
Result: A client can burst up to 250 requests instantly, then sustains 100 requests per second

When to use it: Most public APIs. It is the default recommendation because real traffic is bursty and token bucket handles that naturally.

Leaky Bucket

Requests enter a queue (the bucket) and are processed at a fixed, constant rate. If the queue is full, new requests are dropped.

How it works: The bucket has a fixed size (queue depth) and a fixed leak rate (processing rate). Incoming requests fill the bucket; the bucket drains at a constant rate. If a burst arrives, requests queue up and are processed smoothly. If the queue overflows, excess requests are rejected.

The smoothing effect: Unlike token bucket, leaky bucket does not allow bursts in output. The processing rate is always constant. This is ideal for systems where downstream services cannot handle traffic spikes — database write endpoints, webhook delivery systems, or APIs fronting legacy infrastructure.

The latency cost: Queued requests experience additional latency. A request that arrives when the queue has 50 items ahead of it must wait for all 50 to be processed before it is served. This makes leaky bucket unsuitable for latency-sensitive endpoints.

When to use it: Traffic shaping, webhook delivery, or any system where a constant output rate is more important than low latency. Shopify uses leaky bucket for their REST Admin API.

Implementation: Redis-Based Sliding Window Counter

Redis is the standard backend for distributed rate limiting. Its atomic operations, built-in key expiration (TTL), and sub-millisecond latency make it purpose-built for this workload. Here is a production-ready sliding window counter implementation:

import time
import redis

r = redis.Redis(host="localhost", port=6379, decode_responses=True)

def is_rate_limited(client_id: str, limit: int, window_seconds: int) -> dict:
    """
    Sliding window counter rate limiter using Redis.

    Args:
        client_id: Unique identifier (API key, user ID, etc.)
        limit: Maximum requests allowed per window
        window_seconds: Window duration in seconds

    Returns:
        dict with 'allowed' (bool), 'remaining' (int), and 'reset' (int) keys
    """
    now = time.time()
    current_window = int(now // window_seconds)
    previous_window = current_window - 1
    window_elapsed = (now % window_seconds) / window_seconds

    current_key = f"rate:{client_id}:{current_window}"
    previous_key = f"rate:{client_id}:{previous_window}"

    pipe = r.pipeline()
    pipe.get(previous_key)
    pipe.get(current_key)
    results = pipe.execute()

    previous_count = int(results[0] or 0)
    current_count = int(results[1] or 0)

    # Weighted estimate: blend previous window's count with current
    estimated_count = (previous_count * (1 - window_elapsed)) + current_count

    if estimated_count >= limit:
        reset_time = (current_window + 1) * window_seconds
        return {
            "allowed": False,
            "remaining": 0,
            "reset": int(reset_time),
            "retry_after": int(reset_time - now) + 1,
        }

    # Increment current window counter atomically
    pipe = r.pipeline()
    pipe.incr(current_key)
    pipe.expire(current_key, window_seconds * 2)  # TTL covers current + next window
    pipe.execute()

    remaining = max(0, int(limit - estimated_count - 1))
    reset_time = (current_window + 1) * window_seconds

    return {
        "allowed": True,
        "remaining": remaining,
        "reset": int(reset_time),
    }

Why this works at scale:

Two keys per client — only the current and previous window counters are stored. Memory usage is constant per client regardless of request volume.
Atomic operations — INCR and EXPIRE in a pipeline prevent race conditions in concurrent environments.
Automatic cleanup — Redis TTL ensures expired window counters are garbage-collected without application-side cleanup logic.
No Lua scripting required — the weighted estimate is computed client-side, keeping the Redis interaction simple and fast.

Rate Limit Response Headers

Every API response should include rate limit headers. Clients cannot self-throttle without this information. While there is no universally ratified RFC for rate limit headers, the X-RateLimit-* convention is the de facto standard used by Stripe, GitHub, Twitter/X, and most major APIs.

Standard Headers

Header	Description	Example	When to Include
`X-RateLimit-Limit`	Maximum requests allowed per window	`1000`	Every response
`X-RateLimit-Remaining`	Requests remaining in current window	`847`	Every response
`X-RateLimit-Reset`	Unix timestamp when the window resets	`1741478400`	Every response
`Retry-After`	Seconds to wait before retrying	`30`	429 responses only

The 429 Response

When a client exceeds its rate limit, return HTTP 429 Too Many Requests with both headers and a structured body:

HTTP/1.1 429 Too Many Requests
Content-Type: application/json
X-RateLimit-Limit: 1000
X-RateLimit-Remaining: 0
X-RateLimit-Reset: 1741478400
Retry-After: 30

{
  "error": {
    "type": "rate_limit_exceeded",
    "message": "Rate limit exceeded. Retry after 30 seconds.",
    "retry_after": 30,
    "limit": 1000,
    "reset_at": "2026-03-09T00:00:00Z"
  }
}

Critical rules:

Always use 429, not 403 (forbidden implies authorization failure) or 500 (server error triggers automatic retries in many clients).
Always include Retry-After — this is the single most important header for clients. Without it, clients guess, and they guess wrong.
Include both Unix timestamp and human-readable time in the response body for debugging convenience.
Use a machine-readable error type so clients can programmatically distinguish rate limiting from other errors.

How Major APIs Handle Rate Limiting

API	Algorithm	Limits	Key Behavior
Stripe	Token bucket	100 req/s (live), 25 req/s (test)	Returns `Retry-After` header. Separate limits per API key. Burst-friendly.
GitHub	Fixed window	5,000 req/hr (authenticated)	60 req/hr unauthenticated. Returns `X-RateLimit-*` headers. Secondary limits on content-creation.
Twitter/X	Fixed window	Varies by endpoint and tier	Per-app and per-user limits. Free tier: 1,500 tweets/month. Read limits vary by endpoint.
OpenAI	Token bucket	Varies by model and tier	Limits on both requests-per-minute and tokens-per-minute. Separate limits per model.
Shopify	Leaky bucket	2 req/s (REST), 50 cost/s (GraphQL)	Bucket size of 40 (REST). Returns remaining capacity in headers.
Cloudflare	Sliding window counter	Configurable per zone	Edge-enforced. Rate limiting rules configurable per URL pattern.

Client-Side Handling: Exponential Backoff with Jitter

When your application receives a 429 response, the correct pattern is exponential backoff with full jitter:

Attempt 1: wait 1s  + random(0, 1s)   = 1.0-2.0s
Attempt 2: wait 2s  + random(0, 2s)   = 2.0-4.0s
Attempt 3: wait 4s  + random(0, 4s)   = 4.0-8.0s
Attempt 4: wait 8s  + random(0, 8s)   = 8.0-16.0s
Attempt 5: wait 16s + random(0, 16s)  = 16.0-32.0s
Max attempts: 5 (then fail with error)

Why jitter matters: Without jitter, 1,000 clients that all get rate-limited at the same time will all retry at exactly 1s, then 2s, then 4s — creating synchronized spikes that overwhelm the API again. Random jitter spreads retries across the wait interval, preventing the thundering herd problem.

Always prefer Retry-After: If the API returns a Retry-After header, use that value as the wait time instead of calculating your own backoff. The server knows its own capacity better than your client-side heuristic.

Tier-Based Rate Limits

Rate limiting enables API monetization. Define different limits per pricing tier and enforce them per API key or authentication token.

Tier	Request Limit	Burst Capacity	Additional
Free	100 req/hr	10 req burst	Shared infrastructure
Pro	5,000 req/hr	100 req burst	Priority queue
Enterprise	50,000 req/hr	500 req burst	Dedicated infrastructure, custom limits

Implementation pattern: Store the tier configuration alongside the API key. When a request arrives, look up the key, determine the tier, and apply the corresponding limit. Redis hash maps work well for this — store rate_config:{api_key} with fields for limit, window, and burst_capacity.

Cost-based limits add another dimension. Instead of counting requests equally, assign weights to operations based on their computational cost:

GET  /api/users           → 1 point
GET  /api/users?expand=*  → 5 points
POST /api/search          → 10 points
POST /api/ai/generate     → 50 points

Budget: 10,000 points/hour

This prevents a client from exhausting expensive resources (AI inference, full-text search) while staying within their raw request count.

How to Choose the Right Algorithm

Use this decision framework:

Do you need to allow bursts? Yes --> Token bucket. No --> Leaky bucket.
Is memory a constraint at your scale? Yes --> Fixed window or sliding window counter. No --> Sliding window log.
Do you need exact counts for billing or compliance? Yes --> Sliding window log or token bucket. No --> Sliding window counter.
Is implementation simplicity the priority? Yes --> Fixed window. No --> Token bucket.
Are you shaping traffic for a downstream system that cannot handle spikes? Yes --> Leaky bucket.

Default recommendation: Start with token bucket. It handles the widest range of real-world traffic patterns, allows configurable burst behavior, and is battle-tested by Stripe, AWS, and most major API providers. Move to leaky bucket only if you need guaranteed smooth output, or to sliding window counter if you need Cloudflare-scale memory efficiency.

Methodology

This comparison is based on analysis of rate limiting implementations across major API providers (Stripe, GitHub, Twitter/X, Shopify, Cloudflare, AWS, OpenAI), the IETF RateLimit header field draft (RFC 9110 extension), distributed systems literature on rate limiting algorithms, and production experience implementing rate limiters with Redis. Algorithm characteristics (accuracy, memory usage, burst handling) are evaluated based on their theoretical properties and observed behavior at scale. Header conventions reflect the current industry consensus as of early 2026.

Building an API and need rate limiting? Explore API gateways, rate limiting tools, and best practices on APIScout — architecture guides, implementation comparisons, and developer resources.

The API Integration Checklist (Free PDF)