How to Build Resilient API Integrations That Don't Break

Every API you depend on will go down. It will have bugs. It will change its response format. It will rate-limit you at the worst possible time. The question isn't whether your API integrations will face problems — it's whether your application survives them gracefully.

The Failure Modes

What Goes Wrong with API Integrations

Failure Mode	Frequency	Impact
Timeout	Daily	Slow responses cascade through your system
Rate limiting (429)	Daily-weekly	Requests fail until rate resets
Server error (5xx)	Weekly	Temporary failures, usually recoverable
DNS resolution failure	Monthly	Complete inability to connect
Certificate expiry	Rare but devastating	HTTPS connections fail
Breaking API change	Quarterly	Integration stops working
Response format change	Quarterly	Parsing errors, data corruption
Deprecation	Annually	Endpoints removed, features dropped
Provider shutdown	Rare	Complete integration loss

Pattern 1: Timeouts on Everything

The most common cause of cascading failure: no timeouts.

// ❌ No timeout — request hangs forever if API is slow
const data = await fetch('https://api.example.com/data');

// ✅ Always set timeouts
const data = await fetch('https://api.example.com/data', {
  signal: AbortSignal.timeout(5000), // 5 second timeout
});

// ✅ Even better — different timeouts for different operations
const TIMEOUTS = {
  read: 5000,      // 5s for reads
  write: 10000,    // 10s for writes
  upload: 60000,   // 60s for file uploads
  webhook: 3000,   // 3s for webhook delivery
};

async function apiCall(path: string, type: keyof typeof TIMEOUTS) {
  return fetch(`https://api.example.com${path}`, {
    signal: AbortSignal.timeout(TIMEOUTS[type]),
  });
}

Rule of thumb: Set timeout to 2x the expected response time. If the API normally responds in 200ms, timeout at 500ms-1s.

Pattern 2: Circuit Breaker

Stop calling a broken API. Let it recover instead of overwhelming it with retries.

class CircuitBreaker {
  private failures = 0;
  private lastFailure = 0;
  private state: 'closed' | 'open' | 'half-open' = 'closed';

  constructor(
    private failureThreshold: number = 5,
    private resetTimeMs: number = 30000, // 30 seconds
  ) {}

  async execute<T>(fn: () => Promise<T>): Promise<T> {
    if (this.state === 'open') {
      // Check if enough time has passed to try again
      if (Date.now() - this.lastFailure > this.resetTimeMs) {
        this.state = 'half-open';
      } else {
        throw new CircuitOpenError('Circuit breaker is open');
      }
    }

    try {
      const result = await fn();
      this.onSuccess();
      return result;
    } catch (error) {
      this.onFailure();
      throw error;
    }
  }

  private onSuccess() {
    this.failures = 0;
    this.state = 'closed';
  }

  private onFailure() {
    this.failures++;
    this.lastFailure = Date.now();

    if (this.failures >= this.failureThreshold) {
      this.state = 'open';
    }
  }

  getState() {
    return {
      state: this.state,
      failures: this.failures,
    };
  }
}

class CircuitOpenError extends Error {
  constructor(message: string) {
    super(message);
    this.name = 'CircuitOpenError';
  }
}

// Usage
const paymentCircuit = new CircuitBreaker(5, 30000);

async function processPayment(amount: number) {
  try {
    return await paymentCircuit.execute(() =>
      stripe.charges.create({ amount, currency: 'usd' })
    );
  } catch (error) {
    if (error instanceof CircuitOpenError) {
      // Payment provider is down — queue for later
      await queueForRetry({ amount, type: 'payment' });
      return { status: 'queued', message: 'Payment will be processed shortly' };
    }
    throw error;
  }
}

Pattern 3: Graceful Degradation

When an API is down, serve reduced functionality instead of breaking entirely.

// Example: Product page with reviews from external API

async function getProductPage(productId: string) {
  // Core data — from your database (must succeed)
  const product = await db.products.findById(productId);

  // Enhanced data — from external APIs (can fail gracefully)
  const [reviews, recommendations, inventory] = await Promise.allSettled([
    fetchReviews(productId),        // Third-party reviews API
    fetchRecommendations(productId), // ML recommendation API
    fetchInventory(productId),       // Warehouse API
  ]);

  return {
    product,
    reviews: reviews.status === 'fulfilled'
      ? reviews.value
      : { items: [], message: 'Reviews temporarily unavailable' },
    recommendations: recommendations.status === 'fulfilled'
      ? recommendations.value
      : [],
    inventory: inventory.status === 'fulfilled'
      ? inventory.value
      : { available: true, message: 'Check store for availability' },
  };
}

Degradation Levels

Level	What Works	What's Degraded	User Experience
Full	Everything	Nothing	Normal
Partial	Core features	Enhancements (reviews, recommendations)	Minor loss
Minimal	Read operations	Write operations queued	Can browse, can't act
Cached	Stale data served	No fresh data	"Data as of X minutes ago"
Maintenance	Nothing	Everything	Maintenance page

Pattern 4: Caching and Stale Data

Serve cached data when the API is unavailable:

class CachedAPIClient {
  constructor(
    private cache: Map<string, { data: any; timestamp: number }> = new Map(),
    private maxAge: number = 300000, // 5 minutes
    private staleMaxAge: number = 3600000, // 1 hour (serve stale if API is down)
  ) {}

  async fetch<T>(url: string, options?: RequestInit): Promise<T & { _cached?: boolean }> {
    const cached = this.cache.get(url);

    // Fresh cache — serve immediately
    if (cached && Date.now() - cached.timestamp < this.maxAge) {
      return { ...cached.data, _cached: true };
    }

    // Try fresh fetch
    try {
      const response = await fetch(url, {
        ...options,
        signal: AbortSignal.timeout(5000),
      });

      if (!response.ok) throw new Error(`HTTP ${response.status}`);

      const data = await response.json();
      this.cache.set(url, { data, timestamp: Date.now() });
      return data;
    } catch (error) {
      // Fetch failed — serve stale cache if available
      if (cached && Date.now() - cached.timestamp < this.staleMaxAge) {
        console.warn(`Serving stale cache for ${url} (age: ${Date.now() - cached.timestamp}ms)`);
        return { ...cached.data, _cached: true, _stale: true };
      }

      throw error; // No cache available — propagate error
    }
  }
}

Pattern 5: Idempotent Retry with Deduplication

Safe to retry without duplicate side effects:

async function createOrderWithRetry(orderData: OrderInput): Promise<Order> {
  // Generate idempotency key BEFORE first attempt
  const idempotencyKey = `order_${orderData.userId}_${Date.now()}`;

  for (let attempt = 0; attempt < 3; attempt++) {
    try {
      const response = await fetch('https://api.payments.com/v1/orders', {
        method: 'POST',
        headers: {
          'Content-Type': 'application/json',
          'Idempotency-Key': idempotencyKey, // Same key for all retries
        },
        body: JSON.stringify(orderData),
        signal: AbortSignal.timeout(10000),
      });

      if (response.ok) return response.json();

      if (response.status === 429 || response.status >= 500) {
        // Retryable — same idempotency key means no duplicate charges
        await sleep(Math.pow(2, attempt) * 1000);
        continue;
      }

      // 4xx (except 429) — don't retry, it's a client error
      throw new Error(`HTTP ${response.status}: ${await response.text()}`);
    } catch (error) {
      if (attempt === 2) throw error;
      await sleep(Math.pow(2, attempt) * 1000);
    }
  }

  throw new Error('Max retries exceeded');
}

Pattern 6: Health Check Monitoring

Detect issues before they hit users:

class APIHealthChecker {
  private healthStatus: Map<string, {
    healthy: boolean;
    lastCheck: number;
    latency: number;
    consecutiveFailures: number;
  }> = new Map();

  async check(name: string, healthUrl: string): Promise<boolean> {
    const start = Date.now();

    try {
      const response = await fetch(healthUrl, {
        signal: AbortSignal.timeout(3000),
      });

      const healthy = response.ok;
      const latency = Date.now() - start;

      this.healthStatus.set(name, {
        healthy,
        lastCheck: Date.now(),
        latency,
        consecutiveFailures: healthy ? 0 : (this.healthStatus.get(name)?.consecutiveFailures ?? 0) + 1,
      });

      return healthy;
    } catch {
      const current = this.healthStatus.get(name);
      this.healthStatus.set(name, {
        healthy: false,
        lastCheck: Date.now(),
        latency: Date.now() - start,
        consecutiveFailures: (current?.consecutiveFailures ?? 0) + 1,
      });
      return false;
    }
  }

  getStatus() {
    return Object.fromEntries(this.healthStatus);
  }
}

// Usage: check every 30 seconds
const checker = new APIHealthChecker();

setInterval(async () => {
  await Promise.all([
    checker.check('stripe', 'https://api.stripe.com/v1'),
    checker.check('resend', 'https://api.resend.com/health'),
    checker.check('auth', 'https://api.clerk.com/v1/health'),
  ]);

  const status = checker.getStatus();
  // Alert if any API has 3+ consecutive failures
  for (const [name, state] of Object.entries(status)) {
    if (state.consecutiveFailures >= 3) {
      await alertOps(`${name} API is unhealthy: ${state.consecutiveFailures} consecutive failures`);
    }
  }
}, 30000);

Pattern 7: Response Validation

Don't trust API responses — validate them:

import { z } from 'zod';

// Define expected response shape
const UserResponseSchema = z.object({
  id: z.string(),
  email: z.string().email(),
  name: z.string(),
  created_at: z.string().datetime(),
});

type UserResponse = z.infer<typeof UserResponseSchema>;

async function getUser(userId: string): Promise<UserResponse> {
  const response = await fetch(`/api/users/${userId}`);
  const data = await response.json();

  // Validate response matches expected schema
  const result = UserResponseSchema.safeParse(data);

  if (!result.success) {
    // API response format changed — log and alert
    console.error('API response validation failed:', {
      endpoint: `/api/users/${userId}`,
      errors: result.error.issues,
      received: data,
    });

    // Option 1: Throw (fail fast)
    throw new Error('API response format changed');

    // Option 2: Use with defaults (graceful)
    // return { ...defaults, ...data };
  }

  return result.data;
}

The Resilience Checklist

Pattern	Priority	Impact
Timeouts on all API calls	P0	Prevents cascading failures
Exponential backoff with jitter	P0	Handles rate limits and transient errors
Input/output validation	P0	Catches API changes early
Circuit breaker	P1	Stops hammering failing APIs
Graceful degradation	P1	Users get partial functionality vs errors
Response caching (stale-while-error)	P1	Serves data during outages
Idempotency keys on writes	P1	Safe retries without duplicates
Health check monitoring	P2	Early detection of issues
Multi-provider fallback	P2	Survive provider outages
Response schema validation	P2	Detect breaking changes

Designing for Resilience from the Start

Resilience is significantly cheaper to design in than to retrofit. An API integration that was written assuming the API always succeeds requires invasive changes to add timeouts, retries, and fallbacks — often touching dozens of call sites spread across the codebase. An integration written with resilience from day one costs 20-30% more time initially but avoids the 10x more expensive retrofit when the first production incident hits.

The single most effective architectural decision for resilience is centralization: put all calls to a given API through a single service class rather than calling the API client library directly throughout your codebase. When you need to add a timeout, you add it in one place. When you need to add a circuit breaker, you add it in one place. When the API changes its authentication scheme, you update one place. The patterns in this guide are much easier to implement and maintain when applied to a centralized service layer.

Prioritize based on blast radius: Not all API failures are equal. A payment API failure loses revenue. A recommendation API failure shows generic suggestions. A CDN failure makes pages slightly slower. Invest resilience effort proportionally: circuit breakers and multi-provider fallbacks for payment and auth APIs, simple timeout + retry for enrichment APIs, basic error logging for purely optional integrations. Don't over-engineer resilience for low-blast-radius integrations.

Multi-Provider Fallback Patterns

For business-critical APIs, a single provider creates a single point of failure. Multi-provider fallback — where you have a backup provider that can handle the same operation — is the most robust resilience strategy but also the most complex to implement.

Payment providers: Stripe + PayPal (or Braintree) is the most common combination. When Stripe's API is degraded, route new payment attempts to PayPal. This requires maintaining customer relationships in both systems (customers must have a payment method on file with the backup provider), or using payment method tokenization that works across providers. Stripe's radar fraud detection and PayPal's buyer protection have different characteristics — test the end-to-end experience on both providers before declaring the fallback ready.

Email providers: Resend + Postmark (or SendGrid) is a standard combination. Email fallback is simpler than payment fallback because most transactional email doesn't require cross-provider customer account setup. The main consideration is keeping webhook handlers generic enough to process delivery events from either provider. Store the provider used for each sent email in your database so bounce and complaint events (which arrive via webhook from the original provider) can be matched to the correct record.

AI providers: For LLM-based features, the multi-provider story is the cleanest: Anthropic, OpenAI, and Google all provide chat completion APIs, and the Vercel AI SDK or LiteLLM abstract over the differences. Configure automatic fallback so that if Anthropic is returning 5xx errors, requests automatically route to OpenAI. The quality difference between equivalent models (Claude Sonnet vs GPT-4o) is small enough that users rarely notice.

Observability for Resilience

Resilience patterns only work if you can observe them. A circuit breaker that opens but doesn't alert anyone creates a hidden degradation that could persist for hours before a user reports it.

Instrument every pattern: Log when a circuit breaker opens and closes. Log when a request is served from stale cache. Log when a retry succeeds after failures. Log when graceful degradation activates for a specific component. These logs become your incident diagnosis toolkit — instead of searching through generic error logs, you can query "show me all circuit breaker events in the last hour" and immediately understand the failure pattern.

Create a resilience dashboard: Track the rate of retries, cache hit/miss ratio, circuit breaker state changes, and fallback activations over time. Spikes in retry rates or circuit breaker activations are leading indicators of provider issues — you often see them before the provider's status page is updated. Set alerts when retry rates exceed 5% of total requests to any single provider.

Conduct resilience testing: Once a quarter, simulate failure scenarios in a staging environment. Block all requests to a secondary API and verify the graceful degradation works as designed. Intentionally exhaust a rate limit and verify the retry behavior is correct. These tests reveal gaps in your resilience implementation before a real incident does.

Methodology

The opossum npm library (v8.x) implements the circuit breaker pattern for Node.js with configurable thresholds, half-open probing, and event emitters for state changes — more feature-complete than the custom implementation shown above. AbortSignal.timeout() is available in Node.js 17.3+ and is preferred over manually creating AbortController + setTimeout pairs, as it's cleaner and handles cleanup automatically. The idempotency key pattern (Idempotency-Key header) is supported by Stripe, PayPal, and other payment providers; the key must be unique per operation intent (not per attempt) — generating a new key on each retry would defeat the purpose. Zod v3.x's safeParse() method (used in Pattern 7) returns a discriminated union rather than throwing, making it composable in resilience-aware code that needs to handle validation failures gracefully.

Common Mistakes

Mistake	Impact	Fix
No timeouts	One slow API freezes entire app	Set timeouts on every external call
Retry without backoff	Makes outages worse	Exponential backoff + jitter
Same code path for all errors	Retrying non-retryable errors	Handle 4xx vs 5xx vs network errors differently
No fallback for external APIs	Single point of failure	Cache, degrade, or use backup provider
Trusting API response format	Breaks when API changes	Validate responses with Zod/schemas
No monitoring of API health	Issues discovered by users	Health checks + alerting
Tight coupling to one provider	Locked in when problems arise	Abstraction layer for critical APIs

Find the most reliable APIs on APIScout — uptime tracking, reliability scores, and resilience pattern guides for every provider.

Build Resilient API Integrations That Don't Break 2026