Skip to main content

Build Resilient API Integrations That Don't Break 2026

·APIScout Team
Share:

How to Build Resilient API Integrations That Don't Break

Every API you depend on will go down. It will have bugs. It will change its response format. It will rate-limit you at the worst possible time. The question isn't whether your API integrations will face problems — it's whether your application survives them gracefully.

The Failure Modes

What Goes Wrong with API Integrations

Failure ModeFrequencyImpact
TimeoutDailySlow responses cascade through your system
Rate limiting (429)Daily-weeklyRequests fail until rate resets
Server error (5xx)WeeklyTemporary failures, usually recoverable
DNS resolution failureMonthlyComplete inability to connect
Certificate expiryRare but devastatingHTTPS connections fail
Breaking API changeQuarterlyIntegration stops working
Response format changeQuarterlyParsing errors, data corruption
DeprecationAnnuallyEndpoints removed, features dropped
Provider shutdownRareComplete integration loss

Pattern 1: Timeouts on Everything

The most common cause of cascading failure: no timeouts.

// ❌ No timeout — request hangs forever if API is slow
const data = await fetch('https://api.example.com/data');

// ✅ Always set timeouts
const data = await fetch('https://api.example.com/data', {
  signal: AbortSignal.timeout(5000), // 5 second timeout
});

// ✅ Even better — different timeouts for different operations
const TIMEOUTS = {
  read: 5000,      // 5s for reads
  write: 10000,    // 10s for writes
  upload: 60000,   // 60s for file uploads
  webhook: 3000,   // 3s for webhook delivery
};

async function apiCall(path: string, type: keyof typeof TIMEOUTS) {
  return fetch(`https://api.example.com${path}`, {
    signal: AbortSignal.timeout(TIMEOUTS[type]),
  });
}

Rule of thumb: Set timeout to 2x the expected response time. If the API normally responds in 200ms, timeout at 500ms-1s.

Pattern 2: Circuit Breaker

Stop calling a broken API. Let it recover instead of overwhelming it with retries.

class CircuitBreaker {
  private failures = 0;
  private lastFailure = 0;
  private state: 'closed' | 'open' | 'half-open' = 'closed';

  constructor(
    private failureThreshold: number = 5,
    private resetTimeMs: number = 30000, // 30 seconds
  ) {}

  async execute<T>(fn: () => Promise<T>): Promise<T> {
    if (this.state === 'open') {
      // Check if enough time has passed to try again
      if (Date.now() - this.lastFailure > this.resetTimeMs) {
        this.state = 'half-open';
      } else {
        throw new CircuitOpenError('Circuit breaker is open');
      }
    }

    try {
      const result = await fn();
      this.onSuccess();
      return result;
    } catch (error) {
      this.onFailure();
      throw error;
    }
  }

  private onSuccess() {
    this.failures = 0;
    this.state = 'closed';
  }

  private onFailure() {
    this.failures++;
    this.lastFailure = Date.now();

    if (this.failures >= this.failureThreshold) {
      this.state = 'open';
    }
  }

  getState() {
    return {
      state: this.state,
      failures: this.failures,
    };
  }
}

class CircuitOpenError extends Error {
  constructor(message: string) {
    super(message);
    this.name = 'CircuitOpenError';
  }
}

// Usage
const paymentCircuit = new CircuitBreaker(5, 30000);

async function processPayment(amount: number) {
  try {
    return await paymentCircuit.execute(() =>
      stripe.charges.create({ amount, currency: 'usd' })
    );
  } catch (error) {
    if (error instanceof CircuitOpenError) {
      // Payment provider is down — queue for later
      await queueForRetry({ amount, type: 'payment' });
      return { status: 'queued', message: 'Payment will be processed shortly' };
    }
    throw error;
  }
}

Pattern 3: Graceful Degradation

When an API is down, serve reduced functionality instead of breaking entirely.

// Example: Product page with reviews from external API

async function getProductPage(productId: string) {
  // Core data — from your database (must succeed)
  const product = await db.products.findById(productId);

  // Enhanced data — from external APIs (can fail gracefully)
  const [reviews, recommendations, inventory] = await Promise.allSettled([
    fetchReviews(productId),        // Third-party reviews API
    fetchRecommendations(productId), // ML recommendation API
    fetchInventory(productId),       // Warehouse API
  ]);

  return {
    product,
    reviews: reviews.status === 'fulfilled'
      ? reviews.value
      : { items: [], message: 'Reviews temporarily unavailable' },
    recommendations: recommendations.status === 'fulfilled'
      ? recommendations.value
      : [],
    inventory: inventory.status === 'fulfilled'
      ? inventory.value
      : { available: true, message: 'Check store for availability' },
  };
}

Degradation Levels

LevelWhat WorksWhat's DegradedUser Experience
FullEverythingNothingNormal
PartialCore featuresEnhancements (reviews, recommendations)Minor loss
MinimalRead operationsWrite operations queuedCan browse, can't act
CachedStale data servedNo fresh data"Data as of X minutes ago"
MaintenanceNothingEverythingMaintenance page

Pattern 4: Caching and Stale Data

Serve cached data when the API is unavailable:

class CachedAPIClient {
  constructor(
    private cache: Map<string, { data: any; timestamp: number }> = new Map(),
    private maxAge: number = 300000, // 5 minutes
    private staleMaxAge: number = 3600000, // 1 hour (serve stale if API is down)
  ) {}

  async fetch<T>(url: string, options?: RequestInit): Promise<T & { _cached?: boolean }> {
    const cached = this.cache.get(url);

    // Fresh cache — serve immediately
    if (cached && Date.now() - cached.timestamp < this.maxAge) {
      return { ...cached.data, _cached: true };
    }

    // Try fresh fetch
    try {
      const response = await fetch(url, {
        ...options,
        signal: AbortSignal.timeout(5000),
      });

      if (!response.ok) throw new Error(`HTTP ${response.status}`);

      const data = await response.json();
      this.cache.set(url, { data, timestamp: Date.now() });
      return data;
    } catch (error) {
      // Fetch failed — serve stale cache if available
      if (cached && Date.now() - cached.timestamp < this.staleMaxAge) {
        console.warn(`Serving stale cache for ${url} (age: ${Date.now() - cached.timestamp}ms)`);
        return { ...cached.data, _cached: true, _stale: true };
      }

      throw error; // No cache available — propagate error
    }
  }
}

Pattern 5: Idempotent Retry with Deduplication

Safe to retry without duplicate side effects:

async function createOrderWithRetry(orderData: OrderInput): Promise<Order> {
  // Generate idempotency key BEFORE first attempt
  const idempotencyKey = `order_${orderData.userId}_${Date.now()}`;

  for (let attempt = 0; attempt < 3; attempt++) {
    try {
      const response = await fetch('https://api.payments.com/v1/orders', {
        method: 'POST',
        headers: {
          'Content-Type': 'application/json',
          'Idempotency-Key': idempotencyKey, // Same key for all retries
        },
        body: JSON.stringify(orderData),
        signal: AbortSignal.timeout(10000),
      });

      if (response.ok) return response.json();

      if (response.status === 429 || response.status >= 500) {
        // Retryable — same idempotency key means no duplicate charges
        await sleep(Math.pow(2, attempt) * 1000);
        continue;
      }

      // 4xx (except 429) — don't retry, it's a client error
      throw new Error(`HTTP ${response.status}: ${await response.text()}`);
    } catch (error) {
      if (attempt === 2) throw error;
      await sleep(Math.pow(2, attempt) * 1000);
    }
  }

  throw new Error('Max retries exceeded');
}

Pattern 6: Health Check Monitoring

Detect issues before they hit users:

class APIHealthChecker {
  private healthStatus: Map<string, {
    healthy: boolean;
    lastCheck: number;
    latency: number;
    consecutiveFailures: number;
  }> = new Map();

  async check(name: string, healthUrl: string): Promise<boolean> {
    const start = Date.now();

    try {
      const response = await fetch(healthUrl, {
        signal: AbortSignal.timeout(3000),
      });

      const healthy = response.ok;
      const latency = Date.now() - start;

      this.healthStatus.set(name, {
        healthy,
        lastCheck: Date.now(),
        latency,
        consecutiveFailures: healthy ? 0 : (this.healthStatus.get(name)?.consecutiveFailures ?? 0) + 1,
      });

      return healthy;
    } catch {
      const current = this.healthStatus.get(name);
      this.healthStatus.set(name, {
        healthy: false,
        lastCheck: Date.now(),
        latency: Date.now() - start,
        consecutiveFailures: (current?.consecutiveFailures ?? 0) + 1,
      });
      return false;
    }
  }

  getStatus() {
    return Object.fromEntries(this.healthStatus);
  }
}

// Usage: check every 30 seconds
const checker = new APIHealthChecker();

setInterval(async () => {
  await Promise.all([
    checker.check('stripe', 'https://api.stripe.com/v1'),
    checker.check('resend', 'https://api.resend.com/health'),
    checker.check('auth', 'https://api.clerk.com/v1/health'),
  ]);

  const status = checker.getStatus();
  // Alert if any API has 3+ consecutive failures
  for (const [name, state] of Object.entries(status)) {
    if (state.consecutiveFailures >= 3) {
      await alertOps(`${name} API is unhealthy: ${state.consecutiveFailures} consecutive failures`);
    }
  }
}, 30000);

Pattern 7: Response Validation

Don't trust API responses — validate them:

import { z } from 'zod';

// Define expected response shape
const UserResponseSchema = z.object({
  id: z.string(),
  email: z.string().email(),
  name: z.string(),
  created_at: z.string().datetime(),
});

type UserResponse = z.infer<typeof UserResponseSchema>;

async function getUser(userId: string): Promise<UserResponse> {
  const response = await fetch(`/api/users/${userId}`);
  const data = await response.json();

  // Validate response matches expected schema
  const result = UserResponseSchema.safeParse(data);

  if (!result.success) {
    // API response format changed — log and alert
    console.error('API response validation failed:', {
      endpoint: `/api/users/${userId}`,
      errors: result.error.issues,
      received: data,
    });

    // Option 1: Throw (fail fast)
    throw new Error('API response format changed');

    // Option 2: Use with defaults (graceful)
    // return { ...defaults, ...data };
  }

  return result.data;
}

The Resilience Checklist

PatternPriorityImpact
Timeouts on all API callsP0Prevents cascading failures
Exponential backoff with jitterP0Handles rate limits and transient errors
Input/output validationP0Catches API changes early
Circuit breakerP1Stops hammering failing APIs
Graceful degradationP1Users get partial functionality vs errors
Response caching (stale-while-error)P1Serves data during outages
Idempotency keys on writesP1Safe retries without duplicates
Health check monitoringP2Early detection of issues
Multi-provider fallbackP2Survive provider outages
Response schema validationP2Detect breaking changes

Designing for Resilience from the Start

Resilience is significantly cheaper to design in than to retrofit. An API integration that was written assuming the API always succeeds requires invasive changes to add timeouts, retries, and fallbacks — often touching dozens of call sites spread across the codebase. An integration written with resilience from day one costs 20-30% more time initially but avoids the 10x more expensive retrofit when the first production incident hits.

The single most effective architectural decision for resilience is centralization: put all calls to a given API through a single service class rather than calling the API client library directly throughout your codebase. When you need to add a timeout, you add it in one place. When you need to add a circuit breaker, you add it in one place. When the API changes its authentication scheme, you update one place. The patterns in this guide are much easier to implement and maintain when applied to a centralized service layer.

Prioritize based on blast radius: Not all API failures are equal. A payment API failure loses revenue. A recommendation API failure shows generic suggestions. A CDN failure makes pages slightly slower. Invest resilience effort proportionally: circuit breakers and multi-provider fallbacks for payment and auth APIs, simple timeout + retry for enrichment APIs, basic error logging for purely optional integrations. Don't over-engineer resilience for low-blast-radius integrations.

Multi-Provider Fallback Patterns

For business-critical APIs, a single provider creates a single point of failure. Multi-provider fallback — where you have a backup provider that can handle the same operation — is the most robust resilience strategy but also the most complex to implement.

Payment providers: Stripe + PayPal (or Braintree) is the most common combination. When Stripe's API is degraded, route new payment attempts to PayPal. This requires maintaining customer relationships in both systems (customers must have a payment method on file with the backup provider), or using payment method tokenization that works across providers. Stripe's radar fraud detection and PayPal's buyer protection have different characteristics — test the end-to-end experience on both providers before declaring the fallback ready.

Email providers: Resend + Postmark (or SendGrid) is a standard combination. Email fallback is simpler than payment fallback because most transactional email doesn't require cross-provider customer account setup. The main consideration is keeping webhook handlers generic enough to process delivery events from either provider. Store the provider used for each sent email in your database so bounce and complaint events (which arrive via webhook from the original provider) can be matched to the correct record.

AI providers: For LLM-based features, the multi-provider story is the cleanest: Anthropic, OpenAI, and Google all provide chat completion APIs, and the Vercel AI SDK or LiteLLM abstract over the differences. Configure automatic fallback so that if Anthropic is returning 5xx errors, requests automatically route to OpenAI. The quality difference between equivalent models (Claude Sonnet vs GPT-4o) is small enough that users rarely notice.

Observability for Resilience

Resilience patterns only work if you can observe them. A circuit breaker that opens but doesn't alert anyone creates a hidden degradation that could persist for hours before a user reports it.

Instrument every pattern: Log when a circuit breaker opens and closes. Log when a request is served from stale cache. Log when a retry succeeds after failures. Log when graceful degradation activates for a specific component. These logs become your incident diagnosis toolkit — instead of searching through generic error logs, you can query "show me all circuit breaker events in the last hour" and immediately understand the failure pattern.

Create a resilience dashboard: Track the rate of retries, cache hit/miss ratio, circuit breaker state changes, and fallback activations over time. Spikes in retry rates or circuit breaker activations are leading indicators of provider issues — you often see them before the provider's status page is updated. Set alerts when retry rates exceed 5% of total requests to any single provider.

Conduct resilience testing: Once a quarter, simulate failure scenarios in a staging environment. Block all requests to a secondary API and verify the graceful degradation works as designed. Intentionally exhaust a rate limit and verify the retry behavior is correct. These tests reveal gaps in your resilience implementation before a real incident does.

Methodology

The opossum npm library (v8.x) implements the circuit breaker pattern for Node.js with configurable thresholds, half-open probing, and event emitters for state changes — more feature-complete than the custom implementation shown above. AbortSignal.timeout() is available in Node.js 17.3+ and is preferred over manually creating AbortController + setTimeout pairs, as it's cleaner and handles cleanup automatically. The idempotency key pattern (Idempotency-Key header) is supported by Stripe, PayPal, and other payment providers; the key must be unique per operation intent (not per attempt) — generating a new key on each retry would defeat the purpose. Zod v3.x's safeParse() method (used in Pattern 7) returns a discriminated union rather than throwing, making it composable in resilience-aware code that needs to handle validation failures gracefully.

Common Mistakes

MistakeImpactFix
No timeoutsOne slow API freezes entire appSet timeouts on every external call
Retry without backoffMakes outages worseExponential backoff + jitter
Same code path for all errorsRetrying non-retryable errorsHandle 4xx vs 5xx vs network errors differently
No fallback for external APIsSingle point of failureCache, degrade, or use backup provider
Trusting API response formatBreaks when API changesValidate responses with Zod/schemas
No monitoring of API healthIssues discovered by usersHealth checks + alerting
Tight coupling to one providerLocked in when problems ariseAbstraction layer for critical APIs

Find the most reliable APIs on APIScout — uptime tracking, reliability scores, and resilience pattern guides for every provider.

Related: How to Build an AI Chatbot with the Anthropic API, How to Build an API Abstraction Layer in Your App, How to Build an API SDK That Developers Actually Use

The API Integration Checklist (Free PDF)

Step-by-step checklist: auth setup, rate limit handling, error codes, SDK evaluation, and pricing comparison for 50+ APIs. Used by 200+ developers.

Join 200+ developers. Unsubscribe in one click.