Build Resilient API Integrations That Don't Break 2026
How to Build Resilient API Integrations That Don't Break
Every API you depend on will go down. It will have bugs. It will change its response format. It will rate-limit you at the worst possible time. The question isn't whether your API integrations will face problems — it's whether your application survives them gracefully.
The Failure Modes
What Goes Wrong with API Integrations
| Failure Mode | Frequency | Impact |
|---|---|---|
| Timeout | Daily | Slow responses cascade through your system |
| Rate limiting (429) | Daily-weekly | Requests fail until rate resets |
| Server error (5xx) | Weekly | Temporary failures, usually recoverable |
| DNS resolution failure | Monthly | Complete inability to connect |
| Certificate expiry | Rare but devastating | HTTPS connections fail |
| Breaking API change | Quarterly | Integration stops working |
| Response format change | Quarterly | Parsing errors, data corruption |
| Deprecation | Annually | Endpoints removed, features dropped |
| Provider shutdown | Rare | Complete integration loss |
Pattern 1: Timeouts on Everything
The most common cause of cascading failure: no timeouts.
// ❌ No timeout — request hangs forever if API is slow
const data = await fetch('https://api.example.com/data');
// ✅ Always set timeouts
const data = await fetch('https://api.example.com/data', {
signal: AbortSignal.timeout(5000), // 5 second timeout
});
// ✅ Even better — different timeouts for different operations
const TIMEOUTS = {
read: 5000, // 5s for reads
write: 10000, // 10s for writes
upload: 60000, // 60s for file uploads
webhook: 3000, // 3s for webhook delivery
};
async function apiCall(path: string, type: keyof typeof TIMEOUTS) {
return fetch(`https://api.example.com${path}`, {
signal: AbortSignal.timeout(TIMEOUTS[type]),
});
}
Rule of thumb: Set timeout to 2x the expected response time. If the API normally responds in 200ms, timeout at 500ms-1s.
Pattern 2: Circuit Breaker
Stop calling a broken API. Let it recover instead of overwhelming it with retries.
class CircuitBreaker {
private failures = 0;
private lastFailure = 0;
private state: 'closed' | 'open' | 'half-open' = 'closed';
constructor(
private failureThreshold: number = 5,
private resetTimeMs: number = 30000, // 30 seconds
) {}
async execute<T>(fn: () => Promise<T>): Promise<T> {
if (this.state === 'open') {
// Check if enough time has passed to try again
if (Date.now() - this.lastFailure > this.resetTimeMs) {
this.state = 'half-open';
} else {
throw new CircuitOpenError('Circuit breaker is open');
}
}
try {
const result = await fn();
this.onSuccess();
return result;
} catch (error) {
this.onFailure();
throw error;
}
}
private onSuccess() {
this.failures = 0;
this.state = 'closed';
}
private onFailure() {
this.failures++;
this.lastFailure = Date.now();
if (this.failures >= this.failureThreshold) {
this.state = 'open';
}
}
getState() {
return {
state: this.state,
failures: this.failures,
};
}
}
class CircuitOpenError extends Error {
constructor(message: string) {
super(message);
this.name = 'CircuitOpenError';
}
}
// Usage
const paymentCircuit = new CircuitBreaker(5, 30000);
async function processPayment(amount: number) {
try {
return await paymentCircuit.execute(() =>
stripe.charges.create({ amount, currency: 'usd' })
);
} catch (error) {
if (error instanceof CircuitOpenError) {
// Payment provider is down — queue for later
await queueForRetry({ amount, type: 'payment' });
return { status: 'queued', message: 'Payment will be processed shortly' };
}
throw error;
}
}
Pattern 3: Graceful Degradation
When an API is down, serve reduced functionality instead of breaking entirely.
// Example: Product page with reviews from external API
async function getProductPage(productId: string) {
// Core data — from your database (must succeed)
const product = await db.products.findById(productId);
// Enhanced data — from external APIs (can fail gracefully)
const [reviews, recommendations, inventory] = await Promise.allSettled([
fetchReviews(productId), // Third-party reviews API
fetchRecommendations(productId), // ML recommendation API
fetchInventory(productId), // Warehouse API
]);
return {
product,
reviews: reviews.status === 'fulfilled'
? reviews.value
: { items: [], message: 'Reviews temporarily unavailable' },
recommendations: recommendations.status === 'fulfilled'
? recommendations.value
: [],
inventory: inventory.status === 'fulfilled'
? inventory.value
: { available: true, message: 'Check store for availability' },
};
}
Degradation Levels
| Level | What Works | What's Degraded | User Experience |
|---|---|---|---|
| Full | Everything | Nothing | Normal |
| Partial | Core features | Enhancements (reviews, recommendations) | Minor loss |
| Minimal | Read operations | Write operations queued | Can browse, can't act |
| Cached | Stale data served | No fresh data | "Data as of X minutes ago" |
| Maintenance | Nothing | Everything | Maintenance page |
Pattern 4: Caching and Stale Data
Serve cached data when the API is unavailable:
class CachedAPIClient {
constructor(
private cache: Map<string, { data: any; timestamp: number }> = new Map(),
private maxAge: number = 300000, // 5 minutes
private staleMaxAge: number = 3600000, // 1 hour (serve stale if API is down)
) {}
async fetch<T>(url: string, options?: RequestInit): Promise<T & { _cached?: boolean }> {
const cached = this.cache.get(url);
// Fresh cache — serve immediately
if (cached && Date.now() - cached.timestamp < this.maxAge) {
return { ...cached.data, _cached: true };
}
// Try fresh fetch
try {
const response = await fetch(url, {
...options,
signal: AbortSignal.timeout(5000),
});
if (!response.ok) throw new Error(`HTTP ${response.status}`);
const data = await response.json();
this.cache.set(url, { data, timestamp: Date.now() });
return data;
} catch (error) {
// Fetch failed — serve stale cache if available
if (cached && Date.now() - cached.timestamp < this.staleMaxAge) {
console.warn(`Serving stale cache for ${url} (age: ${Date.now() - cached.timestamp}ms)`);
return { ...cached.data, _cached: true, _stale: true };
}
throw error; // No cache available — propagate error
}
}
}
Pattern 5: Idempotent Retry with Deduplication
Safe to retry without duplicate side effects:
async function createOrderWithRetry(orderData: OrderInput): Promise<Order> {
// Generate idempotency key BEFORE first attempt
const idempotencyKey = `order_${orderData.userId}_${Date.now()}`;
for (let attempt = 0; attempt < 3; attempt++) {
try {
const response = await fetch('https://api.payments.com/v1/orders', {
method: 'POST',
headers: {
'Content-Type': 'application/json',
'Idempotency-Key': idempotencyKey, // Same key for all retries
},
body: JSON.stringify(orderData),
signal: AbortSignal.timeout(10000),
});
if (response.ok) return response.json();
if (response.status === 429 || response.status >= 500) {
// Retryable — same idempotency key means no duplicate charges
await sleep(Math.pow(2, attempt) * 1000);
continue;
}
// 4xx (except 429) — don't retry, it's a client error
throw new Error(`HTTP ${response.status}: ${await response.text()}`);
} catch (error) {
if (attempt === 2) throw error;
await sleep(Math.pow(2, attempt) * 1000);
}
}
throw new Error('Max retries exceeded');
}
Pattern 6: Health Check Monitoring
Detect issues before they hit users:
class APIHealthChecker {
private healthStatus: Map<string, {
healthy: boolean;
lastCheck: number;
latency: number;
consecutiveFailures: number;
}> = new Map();
async check(name: string, healthUrl: string): Promise<boolean> {
const start = Date.now();
try {
const response = await fetch(healthUrl, {
signal: AbortSignal.timeout(3000),
});
const healthy = response.ok;
const latency = Date.now() - start;
this.healthStatus.set(name, {
healthy,
lastCheck: Date.now(),
latency,
consecutiveFailures: healthy ? 0 : (this.healthStatus.get(name)?.consecutiveFailures ?? 0) + 1,
});
return healthy;
} catch {
const current = this.healthStatus.get(name);
this.healthStatus.set(name, {
healthy: false,
lastCheck: Date.now(),
latency: Date.now() - start,
consecutiveFailures: (current?.consecutiveFailures ?? 0) + 1,
});
return false;
}
}
getStatus() {
return Object.fromEntries(this.healthStatus);
}
}
// Usage: check every 30 seconds
const checker = new APIHealthChecker();
setInterval(async () => {
await Promise.all([
checker.check('stripe', 'https://api.stripe.com/v1'),
checker.check('resend', 'https://api.resend.com/health'),
checker.check('auth', 'https://api.clerk.com/v1/health'),
]);
const status = checker.getStatus();
// Alert if any API has 3+ consecutive failures
for (const [name, state] of Object.entries(status)) {
if (state.consecutiveFailures >= 3) {
await alertOps(`${name} API is unhealthy: ${state.consecutiveFailures} consecutive failures`);
}
}
}, 30000);
Pattern 7: Response Validation
Don't trust API responses — validate them:
import { z } from 'zod';
// Define expected response shape
const UserResponseSchema = z.object({
id: z.string(),
email: z.string().email(),
name: z.string(),
created_at: z.string().datetime(),
});
type UserResponse = z.infer<typeof UserResponseSchema>;
async function getUser(userId: string): Promise<UserResponse> {
const response = await fetch(`/api/users/${userId}`);
const data = await response.json();
// Validate response matches expected schema
const result = UserResponseSchema.safeParse(data);
if (!result.success) {
// API response format changed — log and alert
console.error('API response validation failed:', {
endpoint: `/api/users/${userId}`,
errors: result.error.issues,
received: data,
});
// Option 1: Throw (fail fast)
throw new Error('API response format changed');
// Option 2: Use with defaults (graceful)
// return { ...defaults, ...data };
}
return result.data;
}
The Resilience Checklist
| Pattern | Priority | Impact |
|---|---|---|
| Timeouts on all API calls | P0 | Prevents cascading failures |
| Exponential backoff with jitter | P0 | Handles rate limits and transient errors |
| Input/output validation | P0 | Catches API changes early |
| Circuit breaker | P1 | Stops hammering failing APIs |
| Graceful degradation | P1 | Users get partial functionality vs errors |
| Response caching (stale-while-error) | P1 | Serves data during outages |
| Idempotency keys on writes | P1 | Safe retries without duplicates |
| Health check monitoring | P2 | Early detection of issues |
| Multi-provider fallback | P2 | Survive provider outages |
| Response schema validation | P2 | Detect breaking changes |
Designing for Resilience from the Start
Resilience is significantly cheaper to design in than to retrofit. An API integration that was written assuming the API always succeeds requires invasive changes to add timeouts, retries, and fallbacks — often touching dozens of call sites spread across the codebase. An integration written with resilience from day one costs 20-30% more time initially but avoids the 10x more expensive retrofit when the first production incident hits.
The single most effective architectural decision for resilience is centralization: put all calls to a given API through a single service class rather than calling the API client library directly throughout your codebase. When you need to add a timeout, you add it in one place. When you need to add a circuit breaker, you add it in one place. When the API changes its authentication scheme, you update one place. The patterns in this guide are much easier to implement and maintain when applied to a centralized service layer.
Prioritize based on blast radius: Not all API failures are equal. A payment API failure loses revenue. A recommendation API failure shows generic suggestions. A CDN failure makes pages slightly slower. Invest resilience effort proportionally: circuit breakers and multi-provider fallbacks for payment and auth APIs, simple timeout + retry for enrichment APIs, basic error logging for purely optional integrations. Don't over-engineer resilience for low-blast-radius integrations.
Multi-Provider Fallback Patterns
For business-critical APIs, a single provider creates a single point of failure. Multi-provider fallback — where you have a backup provider that can handle the same operation — is the most robust resilience strategy but also the most complex to implement.
Payment providers: Stripe + PayPal (or Braintree) is the most common combination. When Stripe's API is degraded, route new payment attempts to PayPal. This requires maintaining customer relationships in both systems (customers must have a payment method on file with the backup provider), or using payment method tokenization that works across providers. Stripe's radar fraud detection and PayPal's buyer protection have different characteristics — test the end-to-end experience on both providers before declaring the fallback ready.
Email providers: Resend + Postmark (or SendGrid) is a standard combination. Email fallback is simpler than payment fallback because most transactional email doesn't require cross-provider customer account setup. The main consideration is keeping webhook handlers generic enough to process delivery events from either provider. Store the provider used for each sent email in your database so bounce and complaint events (which arrive via webhook from the original provider) can be matched to the correct record.
AI providers: For LLM-based features, the multi-provider story is the cleanest: Anthropic, OpenAI, and Google all provide chat completion APIs, and the Vercel AI SDK or LiteLLM abstract over the differences. Configure automatic fallback so that if Anthropic is returning 5xx errors, requests automatically route to OpenAI. The quality difference between equivalent models (Claude Sonnet vs GPT-4o) is small enough that users rarely notice.
Observability for Resilience
Resilience patterns only work if you can observe them. A circuit breaker that opens but doesn't alert anyone creates a hidden degradation that could persist for hours before a user reports it.
Instrument every pattern: Log when a circuit breaker opens and closes. Log when a request is served from stale cache. Log when a retry succeeds after failures. Log when graceful degradation activates for a specific component. These logs become your incident diagnosis toolkit — instead of searching through generic error logs, you can query "show me all circuit breaker events in the last hour" and immediately understand the failure pattern.
Create a resilience dashboard: Track the rate of retries, cache hit/miss ratio, circuit breaker state changes, and fallback activations over time. Spikes in retry rates or circuit breaker activations are leading indicators of provider issues — you often see them before the provider's status page is updated. Set alerts when retry rates exceed 5% of total requests to any single provider.
Conduct resilience testing: Once a quarter, simulate failure scenarios in a staging environment. Block all requests to a secondary API and verify the graceful degradation works as designed. Intentionally exhaust a rate limit and verify the retry behavior is correct. These tests reveal gaps in your resilience implementation before a real incident does.
Methodology
The opossum npm library (v8.x) implements the circuit breaker pattern for Node.js with configurable thresholds, half-open probing, and event emitters for state changes — more feature-complete than the custom implementation shown above. AbortSignal.timeout() is available in Node.js 17.3+ and is preferred over manually creating AbortController + setTimeout pairs, as it's cleaner and handles cleanup automatically. The idempotency key pattern (Idempotency-Key header) is supported by Stripe, PayPal, and other payment providers; the key must be unique per operation intent (not per attempt) — generating a new key on each retry would defeat the purpose. Zod v3.x's safeParse() method (used in Pattern 7) returns a discriminated union rather than throwing, making it composable in resilience-aware code that needs to handle validation failures gracefully.
Common Mistakes
| Mistake | Impact | Fix |
|---|---|---|
| No timeouts | One slow API freezes entire app | Set timeouts on every external call |
| Retry without backoff | Makes outages worse | Exponential backoff + jitter |
| Same code path for all errors | Retrying non-retryable errors | Handle 4xx vs 5xx vs network errors differently |
| No fallback for external APIs | Single point of failure | Cache, degrade, or use backup provider |
| Trusting API response format | Breaks when API changes | Validate responses with Zod/schemas |
| No monitoring of API health | Issues discovered by users | Health checks + alerting |
| Tight coupling to one provider | Locked in when problems arise | Abstraction layer for critical APIs |
Find the most reliable APIs on APIScout — uptime tracking, reliability scores, and resilience pattern guides for every provider.
Related: How to Build an AI Chatbot with the Anthropic API, How to Build an API Abstraction Layer in Your App, How to Build an API SDK That Developers Actually Use