How to Handle API Rate Limits Gracefully 2026
How to Handle API Rate Limits Gracefully
Every API has rate limits. Hit them, and your requests fail with 429 errors. Handle them poorly, and your users see errors, your batch jobs crash, and your integrations break. Handle them well, and your app stays reliable even when you're pushing limits.
The patterns in this guide form a hierarchy: start with exponential backoff (fixes immediate failures), add client-side rate limiting (prevents most failures), add monitoring (gives visibility into your utilization), and finally consider architecture changes (distributed rate limiting, queue-based processing) only when simpler approaches aren't enough. Most apps need the first two patterns; high-volume batch processing and multi-instance deployments need all of them.
How Rate Limits Work
Common Rate Limit Types
| Type | How It Works | Example |
|---|---|---|
| Requests per second | Fixed window of requests per second | 10 req/s |
| Requests per minute | Fixed window per minute | 100 req/min |
| Token bucket | Tokens refill at steady rate, burst allowed | 100 tokens, 10/s refill |
| Sliding window | Rolling time window, no burst edge | 100 req in any 60s window |
| Concurrent | Max simultaneous requests | 5 concurrent connections |
| Daily quota | Fixed daily limit | 10,000 req/day |
| Token-based (AI) | Tokens per minute (TPM) | 100K TPM |
Rate Limit Headers
Most APIs tell you about limits in response headers:
HTTP/1.1 200 OK
X-RateLimit-Limit: 100
X-RateLimit-Remaining: 87
X-RateLimit-Reset: 1704067200
Retry-After: 30
# Or the newer standard (RFC 9110):
RateLimit-Limit: 100
RateLimit-Remaining: 87
RateLimit-Reset: 30
The 429 Response
HTTP/1.1 429 Too Many Requests
Retry-After: 30
Content-Type: application/json
{
"error": {
"code": "rate_limit_exceeded",
"message": "Rate limit exceeded. Please retry after 30 seconds.",
"retry_after": 30
}
}
Pattern 1: Exponential Backoff with Jitter
The most important pattern. Retry failed requests with increasing delays.
async function fetchWithRetry<T>(
url: string,
options: RequestInit,
maxRetries = 5
): Promise<T> {
for (let attempt = 0; attempt <= maxRetries; attempt++) {
try {
const response = await fetch(url, options);
if (response.status === 429) {
// Respect Retry-After header if present
const retryAfter = response.headers.get('Retry-After');
const waitMs = retryAfter
? parseInt(retryAfter) * 1000
: calculateBackoff(attempt);
console.log(`Rate limited. Waiting ${waitMs}ms before retry ${attempt + 1}`);
await sleep(waitMs);
continue;
}
if (response.status >= 500 && attempt < maxRetries) {
// Server error — also worth retrying
await sleep(calculateBackoff(attempt));
continue;
}
if (!response.ok) {
throw new Error(`HTTP ${response.status}: ${await response.text()}`);
}
return response.json();
} catch (error) {
if (attempt === maxRetries) throw error;
if (error instanceof TypeError) {
// Network error — retry
await sleep(calculateBackoff(attempt));
continue;
}
throw error;
}
}
throw new Error('Max retries exceeded');
}
function calculateBackoff(attempt: number): number {
// Exponential backoff: 1s, 2s, 4s, 8s, 16s
const baseMs = Math.pow(2, attempt) * 1000;
// Add jitter: random ±50% to prevent thundering herd
const jitter = baseMs * (0.5 + Math.random());
// Cap at 30 seconds
return Math.min(jitter, 30000);
}
function sleep(ms: number): Promise<void> {
return new Promise(resolve => setTimeout(resolve, ms));
}
Why jitter matters: Without jitter, all retry requests hit the API at the same time (thundering herd). Jitter spreads them out.
Pattern 2: Client-Side Rate Limiting
Don't wait for 429s — prevent them by throttling requests yourself.
class RateLimiter {
private queue: Array<{
execute: () => Promise<any>;
resolve: (value: any) => void;
reject: (error: any) => void;
}> = [];
private activeCount = 0;
private timestamps: number[] = [];
constructor(
private maxPerSecond: number,
private maxConcurrent: number = 10
) {}
async execute<T>(fn: () => Promise<T>): Promise<T> {
return new Promise((resolve, reject) => {
this.queue.push({ execute: fn, resolve, reject });
this.processQueue();
});
}
private async processQueue() {
if (this.queue.length === 0) return;
if (this.activeCount >= this.maxConcurrent) return;
// Clean old timestamps
const now = Date.now();
this.timestamps = this.timestamps.filter(t => now - t < 1000);
if (this.timestamps.length >= this.maxPerSecond) {
// Wait until oldest timestamp expires
const waitMs = 1000 - (now - this.timestamps[0]);
setTimeout(() => this.processQueue(), waitMs);
return;
}
const item = this.queue.shift();
if (!item) return;
this.activeCount++;
this.timestamps.push(now);
try {
const result = await item.execute();
item.resolve(result);
} catch (error) {
item.reject(error);
} finally {
this.activeCount--;
this.processQueue();
}
}
}
// Usage
const limiter = new RateLimiter(10, 5); // 10 req/s, 5 concurrent
const results = await Promise.all(
userIds.map(id =>
limiter.execute(() => fetch(`/api/users/${id}`).then(r => r.json()))
)
);
Pattern 3: Token Bucket
For APIs with token-bucket rate limiting (like AI APIs with tokens-per-minute):
class TokenBucket {
private tokens: number;
private lastRefill: number;
constructor(
private maxTokens: number,
private refillRate: number, // tokens per second
) {
this.tokens = maxTokens;
this.lastRefill = Date.now();
}
private refill() {
const now = Date.now();
const elapsed = (now - this.lastRefill) / 1000;
this.tokens = Math.min(this.maxTokens, this.tokens + elapsed * this.refillRate);
this.lastRefill = now;
}
async consume(count: number): Promise<void> {
this.refill();
if (this.tokens >= count) {
this.tokens -= count;
return;
}
// Wait for enough tokens
const deficit = count - this.tokens;
const waitMs = (deficit / this.refillRate) * 1000;
await new Promise(resolve => setTimeout(resolve, waitMs));
this.refill();
this.tokens -= count;
}
}
// Usage with AI API (tokens per minute)
const bucket = new TokenBucket(100000, 100000 / 60); // 100K TPM
async function callAI(prompt: string) {
const estimatedTokens = prompt.length / 4; // rough estimate
await bucket.consume(estimatedTokens);
return openai.chat.completions.create({
model: 'gpt-4o',
messages: [{ role: 'user', content: prompt }],
});
}
Pattern 4: Queue-Based Processing
For batch jobs that need to process thousands of items:
class BatchProcessor<T, R> {
private queue: T[] = [];
private results: Map<number, R> = new Map();
constructor(
private processFn: (item: T) => Promise<R>,
private options: {
maxPerSecond: number;
maxConcurrent: number;
onProgress?: (completed: number, total: number) => void;
}
) {}
async process(items: T[]): Promise<R[]> {
this.queue = [...items];
const total = items.length;
let completed = 0;
let active = 0;
const results: R[] = new Array(total);
return new Promise((resolve, reject) => {
const interval = setInterval(() => {
while (
active < this.options.maxConcurrent &&
this.queue.length > 0
) {
const index = total - this.queue.length;
const item = this.queue.shift()!;
active++;
this.processFn(item)
.then(result => {
results[index] = result;
completed++;
active--;
this.options.onProgress?.(completed, total);
if (completed === total) {
clearInterval(interval);
resolve(results);
}
})
.catch(error => {
clearInterval(interval);
reject(error);
});
}
}, 1000 / this.options.maxPerSecond);
});
}
}
// Usage
const processor = new BatchProcessor(
async (userId: string) => {
const response = await fetch(`/api/users/${userId}`);
return response.json();
},
{
maxPerSecond: 10,
maxConcurrent: 5,
onProgress: (done, total) => console.log(`${done}/${total}`),
}
);
const allUsers = await processor.process(userIds);
Pattern 5: Adaptive Rate Limiting
Automatically adjust your request rate based on API responses:
class AdaptiveRateLimiter {
private requestsPerSecond: number;
private consecutiveSuccesses = 0;
private consecutiveFailures = 0;
constructor(
private initialRate: number,
private maxRate: number,
private minRate: number = 1
) {
this.requestsPerSecond = initialRate;
}
onSuccess() {
this.consecutiveSuccesses++;
this.consecutiveFailures = 0;
// Increase rate after 10 consecutive successes
if (this.consecutiveSuccesses >= 10) {
this.requestsPerSecond = Math.min(
this.maxRate,
this.requestsPerSecond * 1.2
);
this.consecutiveSuccesses = 0;
}
}
onRateLimit() {
this.consecutiveFailures++;
this.consecutiveSuccesses = 0;
// Cut rate in half on rate limit
this.requestsPerSecond = Math.max(
this.minRate,
this.requestsPerSecond * 0.5
);
}
getDelayMs(): number {
return 1000 / this.requestsPerSecond;
}
}
Provider-Specific Rate Limits
Quick Reference
| Provider | Rate Limit | Headers | Retry Strategy |
|---|---|---|---|
| Stripe | 100/s (live), 25/s (test) | Standard X-RateLimit-* | Exponential backoff |
| OpenAI | TPM + RPM per model | Standard + usage headers | Exponential backoff, token estimation |
| Anthropic | TPM + RPM per tier | Standard | Backoff + tier upgrade |
| Twilio | 100/s per account | Standard | Backoff + request queuing |
| GitHub | 5,000/hour (auth) | X-RateLimit-* | Respect reset time |
| Shopify | 2/s (REST), cost-based (GraphQL) | X-Shopify-Shop-Api-Call-Limit | Leaky bucket |
| Algolia | Varies by plan | Standard | Client-side limiting |
Monitoring Rate Limits
// Track rate limit usage
class RateLimitMonitor {
private metrics = {
totalRequests: 0,
rateLimitedRequests: 0,
totalRetries: 0,
avgRetryDelay: 0,
};
recordRequest(wasRateLimited: boolean, retryCount: number, retryDelayMs: number) {
this.metrics.totalRequests++;
if (wasRateLimited) {
this.metrics.rateLimitedRequests++;
this.metrics.totalRetries += retryCount;
}
}
getReport() {
return {
...this.metrics,
rateLimitRate: this.metrics.rateLimitedRequests / this.metrics.totalRequests,
recommendation: this.metrics.rateLimitedRequests / this.metrics.totalRequests > 0.05
? 'Consider reducing request rate or upgrading API tier'
: 'Rate limit handling is healthy',
};
}
}
Understanding Your Rate Limit Budget
Most teams only discover their rate limit budget when they hit it — which means they're learning under pressure, often in an incident. Understanding your budget proactively lets you architect around limits before they become problems.
Calculate your effective request rate: For batch jobs and high-volume operations, calculate the request rate you'll need. If you need to process 1 million records and your API allows 100 requests per minute, that's 10,000 minutes — almost 7 days. This math should happen during planning, not production. When the required rate exceeds the available rate by more than 3x, either redesign the approach (batching, fewer API calls per record) or negotiate with the provider before building.
Token-per-minute math for AI APIs: OpenAI and Anthropic rate limit by tokens per minute (TPM), not requests per minute. A single GPT-4o request can consume anywhere from 100 to 128,000 tokens depending on input length and max_tokens setting. If you're processing long documents (2,000+ tokens each) at GPT-4o tier 1 limits (30,000 TPM), you can only process about 15 documents per minute — far fewer than the RPM limit suggests. Always calculate expected token consumption, not just request count, when planning AI integrations.
Header monitoring in production: Read and log rate limit headers on every response. Even when requests succeed, X-RateLimit-Remaining: 3 is a warning — you're about to hit the wall. Set up an alert when remaining drops below 20% of the limit. For OpenAI, the x-ratelimit-remaining-tokens header tells you how close you are to the TPM limit before you hit the first 429.
Rate Limiting in Distributed Systems
Client-side rate limiters work well for single-process applications but break down when your app runs multiple instances. If you have 10 instances of your API server, each with a local rate limiter allowing 10 requests/second to a downstream API, you're actually sending 100 requests/second total — 10x your intended limit.
Redis-based distributed rate limiting: Use Redis for a shared rate limit counter across all instances. The INCR + EXPIRE pattern is the simplest approach: increment a counter on each request, set expiry to the rate window, and reject if the counter exceeds the limit. For more sophisticated needs, Redis also supports the sliding window and token bucket patterns with Lua scripts that execute atomically.
Coordinated queue draining: For batch processing across multiple workers, use a single coordinated job queue (BullMQ, Temporal, or a cloud-managed queue) rather than independent per-instance queues. The queue coordinator enforces the global rate limit; individual workers pull from the queue at whatever rate the coordinator allows. This pattern is more complex to set up but eliminates the distributed rate limit problem entirely and provides better visibility into queue depth, processing rate, and failures.
Circuit breakers for downstream rate limits: When a downstream API starts returning 429s, a circuit breaker opens and stops requests to that service for a configured duration, then probes with a single request before re-enabling full traffic. This is healthier than retrying immediately: it protects your system from cascading failures when an upstream service is overloaded, and it reduces the load on the upstream service during its recovery. The opossum library provides circuit breaker functionality for Node.js.
When to Upgrade vs. Optimize
Hitting rate limits means either your usage has outgrown the tier or your code is making more requests than necessary. The right response depends on which is true.
Signs you should optimize first: Sending the same API request multiple times with the same inputs (missing caching); making API calls in loops where a single batched call would work; fetching paginated resources where you only need the first page; calling an API on every request when the response only changes hourly; fetching all fields when you only use one. Fix these before paying for a higher tier — optimization often reduces API costs by 50-80% without a plan upgrade.
Signs you should upgrade: You've profiled the code and there's no optimization left; the API doesn't offer batching; you genuinely need the data freshness that caching would compromise; or the cost of optimization exceeds the cost of a tier upgrade. Calculate the math: a plan upgrade from $100/month to $300/month costs $2,400/year. If engineering time to optimize is estimated at 40+ hours at your team's fully-loaded rate, upgrading is cheaper.
Negotiating custom limits: For enterprise volumes that exceed published tiers, most API providers will negotiate custom rate limits. Stripe, OpenAI, Anthropic, and most large API businesses have enterprise teams that can accommodate high-volume customers. Come to the conversation with data: your current volume, expected growth trajectory, and which specific limits you need increased. For AI APIs, committing to a minimum monthly spend often unlocks higher rate limits without per-request cost increases.
Methodology
The exponential backoff formula Math.pow(2, attempt) * 1000 produces delays of 1s, 2s, 4s, 8s, 16s for attempts 0-4. The ±50% jitter range is a practical default; AWS recommends ±25% jitter for their SDKs. The Retry-After header may be either an integer (seconds to wait) or an HTTP date string; always check the type before parsing. The token bucket implementation above is single-process only; for distributed systems, use Redis with the rate-limiter-flexible npm package, which implements all major rate limit algorithms with Redis backends and atomic Lua scripts. OpenAI's rate limit tiers (Tier 1 through Tier 5) are documented at platform.openai.com/docs/guides/rate-limits; tiers increase based on cumulative spend history.
| Retry without backoff | Makes rate limiting worse | Add exponential delay + jitter | | Ignoring Retry-After header | Retrying too soon | Parse and respect Retry-After | | No client-side throttling | Hit 429s constantly | Pre-limit requests to known rate | | Fixed delay retries | Thundering herd problem | Add jitter to retry delays | | No monitoring of 429 rates | Don't know you have a problem | Track rate limit hit percentage | | Retrying on all errors | Retrying permanent failures | Only retry 429 and 5xx |
Compare API rate limits across providers on APIScout — find the most generous limits and best rate limit handling documentation.
Related: How to Handle API Deprecation Notices, How to Handle Webhook Failures and Retries, Building an AI Agent in 2026