Skip to main content

How to Handle Webhook Failures and Retries 2026

·APIScout Team
Share:

How to Handle Webhook Failures and Retries

Webhooks are fire-and-forget from the sender's perspective. If your handler crashes, times out, or returns an error, the webhook provider retries — sometimes for days. Handling this correctly means your app processes every event exactly once, even when things go wrong.

How Webhook Retries Work

Provider Retry Policies

ProviderMax RetriesRetry ScheduleTimeout
Stripe~15 over 3 daysExponential backoff20 seconds
GitHub310s, 60s, 360s10 seconds
TwilioUp to 14Exponential15 seconds
Shopify19 over 48 hoursExponential5 seconds
PayPal15 over 3 daysExponential30 seconds
ClerkMultiple over 3 daysExponential30 seconds

What Triggers a Retry

ResponseProvider Action
2xx (200-299)✅ Success — no retry
3xx (redirect)❌ Treated as failure, retries
4xx (client error)⚠️ Varies — some providers stop, others retry
5xx (server error)❌ Retry with backoff
Timeout❌ Retry with backoff
Connection refused❌ Retry with backoff

Pattern 1: Fast Acknowledgment

Return 200 immediately, process asynchronously:

// ❌ Bad: Process synchronously (can timeout)
app.post('/webhooks/stripe', async (req, res) => {
  const event = verifySignature(req);
  await updateDatabase(event);        // 500ms
  await sendNotification(event);      // 300ms
  await updateAnalytics(event);       // 200ms
  res.status(200).send('OK');         // Total: 1s+ (might timeout)
});

// ✅ Good: Acknowledge fast, process async
app.post('/webhooks/stripe', async (req, res) => {
  // 1. Verify signature (fast — <10ms)
  const event = verifySignature(req);

  // 2. Store raw event (fast — <50ms)
  await db.webhookEvents.create({
    id: event.id,
    type: event.type,
    payload: event,
    status: 'pending',
    receivedAt: new Date(),
  });

  // 3. Acknowledge immediately
  res.status(200).send('OK');

  // 4. Process asynchronously
  processWebhookAsync(event).catch(error => {
    console.error('Webhook processing failed:', error);
  });
});

Pattern 2: Idempotent Processing

Webhooks can be delivered multiple times. Process each event exactly once:

async function processWebhookEvent(event: WebhookEvent): Promise<void> {
  // Check if already processed
  const existing = await db.webhookEvents.findById(event.id);

  if (existing?.status === 'processed') {
    console.log(`Event ${event.id} already processed, skipping`);
    return;
  }

  // Use a transaction to prevent race conditions
  await db.transaction(async (tx) => {
    // Double-check inside transaction (another worker might have started)
    const locked = await tx.webhookEvents.findByIdForUpdate(event.id);
    if (locked?.status === 'processed') return;

    // Process the event
    await handleEvent(event, tx);

    // Mark as processed
    await tx.webhookEvents.update(event.id, {
      status: 'processed',
      processedAt: new Date(),
    });
  });
}

async function handleEvent(event: WebhookEvent, tx: Transaction) {
  switch (event.type) {
    case 'payment_intent.succeeded':
      // Use idempotency key for downstream operations too
      await fulfillOrder(event.data.object.id, tx);
      break;
    case 'customer.subscription.deleted':
      await deactivateSubscription(event.data.object.id, tx);
      break;
    // ... other event types
  }
}

Pattern 3: Signature Verification

Always verify webhook signatures to prevent forgery:

import crypto from 'crypto';

// Stripe signature verification
function verifyStripeSignature(
  payload: string, // Raw body string, NOT parsed JSON
  signature: string,
  secret: string
): boolean {
  const elements = signature.split(',');
  const timestamp = elements.find(e => e.startsWith('t='))?.slice(2);
  const v1Signature = elements.find(e => e.startsWith('v1='))?.slice(3);

  if (!timestamp || !v1Signature) return false;

  // Prevent replay attacks (reject if older than 5 minutes)
  const now = Math.floor(Date.now() / 1000);
  if (now - parseInt(timestamp) > 300) return false;

  const signedPayload = `${timestamp}.${payload}`;
  const expected = crypto
    .createHmac('sha256', secret)
    .update(signedPayload)
    .digest('hex');

  return crypto.timingSafeEqual(
    Buffer.from(v1Signature),
    Buffer.from(expected)
  );
}

// Generic HMAC verification (works for most providers)
function verifyHmacSignature(
  payload: string,
  signature: string,
  secret: string,
  algorithm: string = 'sha256'
): boolean {
  const expected = crypto
    .createHmac(algorithm, secret)
    .update(payload)
    .digest('hex');

  return crypto.timingSafeEqual(
    Buffer.from(signature),
    Buffer.from(expected)
  );
}

Critical: Read the raw request body as a string, NOT parsed JSON. Parsing then re-stringifying changes the payload and breaks signature verification.

// Next.js App Router — get raw body
export async function POST(request: Request) {
  const rawBody = await request.text();
  const signature = request.headers.get('stripe-signature')!;

  if (!verifyStripeSignature(rawBody, signature, WEBHOOK_SECRET)) {
    return new Response('Invalid signature', { status: 401 });
  }

  const event = JSON.parse(rawBody);
  // ... process event
}

Pattern 4: Dead Letter Queue

When processing fails after all retries, don't lose the event:

class WebhookProcessor {
  async process(event: WebhookEvent): Promise<void> {
    const MAX_INTERNAL_RETRIES = 3;

    for (let attempt = 0; attempt < MAX_INTERNAL_RETRIES; attempt++) {
      try {
        await this.handleEvent(event);
        await this.markProcessed(event.id);
        return;
      } catch (error) {
        console.error(`Attempt ${attempt + 1} failed for event ${event.id}:`, error);

        if (attempt < MAX_INTERNAL_RETRIES - 1) {
          await new Promise(r => setTimeout(r, Math.pow(2, attempt) * 1000));
        }
      }
    }

    // All retries failed — move to dead letter queue
    await this.moveToDeadLetter(event);
  }

  private async moveToDeadLetter(event: WebhookEvent) {
    await db.deadLetterQueue.create({
      eventId: event.id,
      eventType: event.type,
      payload: event,
      failedAt: new Date(),
      retryCount: 0,
    });

    // Alert team
    await alertSlack(`⚠️ Webhook event failed permanently: ${event.type} (${event.id})`);
  }
}

// Admin tool: retry dead letter events
async function retryDeadLetterEvents() {
  const failed = await db.deadLetterQueue.findAll({ status: 'failed' });

  for (const item of failed) {
    try {
      await processor.handleEvent(item.payload);
      await db.deadLetterQueue.update(item.id, { status: 'resolved' });
      console.log(`Resolved dead letter event: ${item.eventId}`);
    } catch (error) {
      await db.deadLetterQueue.update(item.id, {
        retryCount: item.retryCount + 1,
        lastError: String(error),
      });
    }
  }
}

Pattern 5: Event Ordering

Webhooks may arrive out of order. Handle this:

// Problem: "subscription.updated" arrives before "subscription.created"
// Solution: Use event timestamps and idempotent operations

async function handleSubscriptionEvent(event: WebhookEvent) {
  const subscription = event.data.object;

  await db.subscriptions.upsert({
    id: subscription.id,
    // Only update if this event is newer than what we have
    where: {
      id: subscription.id,
      updatedAt: { lt: new Date(event.created * 1000) },
    },
    create: {
      id: subscription.id,
      status: subscription.status,
      customerId: subscription.customer,
      updatedAt: new Date(event.created * 1000),
    },
    update: {
      status: subscription.status,
      updatedAt: new Date(event.created * 1000),
    },
  });
}

Pattern 6: Monitoring Webhook Health

class WebhookMonitor {
  async recordEvent(eventId: string, type: string, status: 'received' | 'processed' | 'failed') {
    await db.webhookMetrics.create({
      eventId,
      type,
      status,
      timestamp: new Date(),
    });
  }

  async getHealth(hours: number = 24) {
    const since = new Date(Date.now() - hours * 3600000);
    const events = await db.webhookMetrics.findMany({
      where: { timestamp: { gte: since } },
    });

    const received = events.filter(e => e.status === 'received').length;
    const processed = events.filter(e => e.status === 'processed').length;
    const failed = events.filter(e => e.status === 'failed').length;

    return {
      received,
      processed,
      failed,
      successRate: processed / received,
      failureRate: failed / received,
      alert: failed / received > 0.05 ? 'HIGH' : 'OK',
    };
  }
}

Testing Webhooks

// Generate test webhook events locally
import Stripe from 'stripe';

const stripe = new Stripe(process.env.STRIPE_SECRET_KEY!);

// Stripe CLI for local testing
// stripe listen --forward-to localhost:3000/webhooks/stripe
// stripe trigger payment_intent.succeeded

// Programmatic test
test('handles payment succeeded webhook', async () => {
  const event = {
    id: 'evt_test_123',
    type: 'payment_intent.succeeded',
    created: Math.floor(Date.now() / 1000),
    data: {
      object: {
        id: 'pi_test_456',
        amount: 2000,
        status: 'succeeded',
        customer: 'cus_test_789',
      },
    },
  };

  const payload = JSON.stringify(event);
  const signature = stripe.webhooks.generateTestHeaderString({
    payload,
    secret: WEBHOOK_SECRET,
  });

  const response = await app.inject({
    method: 'POST',
    url: '/webhooks/stripe',
    headers: {
      'stripe-signature': signature,
      'content-type': 'application/json',
    },
    body: payload,
  });

  expect(response.statusCode).toBe(200);
  const order = await db.orders.findByPaymentIntent('pi_test_456');
  expect(order.status).toBe('paid');
});

Common Mistakes

MistakeImpactFix
Processing synchronouslyHandler timeouts, missed eventsAcknowledge fast, process async
No idempotencyDuplicate processing on retriesCheck event ID before processing
Parsing body before signature checkSignature verification failsUse raw body string for verification
No dead letter queueFailed events lost foreverStore failed events for manual retry
Assuming event orderRace conditions, data inconsistencyUse timestamps, idempotent operations
No webhook monitoringDon't know when things breakTrack success/failure rates

Choosing Your Webhook Infrastructure

The patterns above work at any scale, but your infrastructure choices depend on your volume and team size.

For small teams (< 1,000 webhooks/day): The database-backed queue shown in Pattern 1 is sufficient. Use your existing PostgreSQL or MySQL instance. Add an index on (status, receivedAt) for efficient queue polling. A simple cron job every minute picks up pending events. This requires no additional infrastructure.

For medium volume (1K–100K/day): Consider a dedicated job queue like BullMQ (Redis-backed) or a managed service like Inngest. BullMQ gives you priority queues, delayed retries, and job visibility with minimal configuration. Inngest provides a managed fan-out layer with a replay UI, which is particularly useful when you need to replay events after a bug fix without re-triggering the webhook source.

For high volume (100K+/day): Dedicated message queues — AWS SQS, Google Pub/Sub, or Kafka — become cost-effective. SQS FIFO queues provide exactly-once delivery guarantees at the message level. Kafka is worth the operational overhead when you need event replay across multiple consumers or audit trails going back months.

One pattern worth adopting early: separate your webhook ingestion from your webhook processing. The ingestion layer (receive → verify → store → ack) should be a simple, fast endpoint that almost never fails. The processing layer (parse → handle → side effects) is where complexity lives. By keeping these decoupled, a processing bug never causes you to return 500 to the webhook sender, which avoids a cascade of retries from the provider.

Handling Provider-Specific Quirks

Each major webhook provider has idiosyncrasies worth knowing before you're debugging at 2am.

Stripe delivers events in roughly chronological order, but "roughly" is doing a lot of work in that sentence. If a customer's card is declined and then they update their card and pay successfully, you might receive payment_intent.payment_failed after payment_intent.succeeded. Always use the created timestamp on events, not arrival order, to determine the current state of an object. Stripe also resends events if you return a non-2xx after already processing them — your idempotency check must be bulletproof.

GitHub webhooks have a 10-second timeout — the tightest of any major provider. If you're processing GitHub events (push, pull request, check run), you must return 200 within 10 seconds or the delivery is marked failed. GitHub retries only 3 times with a short backoff, so missed events don't self-heal as gracefully as Stripe's 3-day retry window.

Shopify webhooks use HMAC-SHA256 with the raw body, but base64-encodes the HMAC in the X-Shopify-Hmac-SHA256 header. Don't compare it against a hex digest — compare base64 strings. Shopify also has a 5-second timeout and delivers to all your registered endpoints simultaneously on each trigger, so a single event can create multiple webhooks for different topics.

Svix (used by Clerk, Resend, and many others) standardizes webhook delivery with retries, portal UIs, and replay functionality. If you're building a platform where your customers receive webhooks from you, Svix is worth evaluating as a managed delivery layer rather than building retry infrastructure yourself.

Webhook Security Beyond Signature Verification

Signature verification is necessary but not sufficient for a production webhook system.

Allowlist provider IP ranges: Stripe, GitHub, and other major providers publish static IP ranges they use for outbound webhooks. Configure your firewall or load balancer to only accept webhook requests from these ranges. This prevents attackers from sending crafted payloads to your endpoint even if they somehow obtained your webhook secret.

Rate limiting per provider: Your webhook endpoint should have rate limiting specific to each provider. Stripe sends at most a few events per second under normal conditions; a sudden spike of 500 requests/second from "Stripe's IP range" is suspicious. A misconfigured client or an attacker can cause a DDoS via webhook flooding.

Secrets rotation: Rotate webhook secrets on a schedule (every 90 days is reasonable) and after any suspected breach. Most providers support rotating secrets with a brief dual-validation window so you can update your handler without dropping events. Store webhook secrets in your secrets manager (AWS Secrets Manager, HashiCorp Vault, or similar), not in environment variables baked into build artifacts.

Endpoint URL hygiene: Avoid putting provider names or internal routing details in your webhook URL path. /webhooks/stripe tells an attacker exactly what to craft. /api/events/inbound or /api/hooks/p1 with provider identification done via header is less informative to an attacker probing your endpoints. This is security through obscurity and should never replace signature verification, but it's a cheap layer of defense.

Replay detection window: Beyond the 5-minute timestamp tolerance in the signature verification code above, consider tracking event IDs in a cache (Redis with a 24-hour TTL works well) and rejecting duplicate IDs immediately. This is different from your idempotency check in the processing layer — this check happens at the ingestion layer and prevents even storing the event twice, which matters when you're tracking storage costs or auditing event volumes.

Observability for Webhook-Heavy Systems

If webhooks are central to your business logic (e.g., payment confirmations, subscription lifecycle), you need dedicated observability beyond application logs.

Track four key metrics over rolling 1-hour and 24-hour windows: events received, events processed successfully, events failed (after dead letter), and processing lag (time from receivedAt to processedAt). Processing lag is the most actionable metric — a rising lag means your workers are falling behind, which can compound during high-traffic periods.

Set up alerts that page on-call when: the failure rate exceeds 2% over a 1-hour window, processing lag exceeds 5 minutes for any event type, or the dead letter queue size exceeds a threshold. For payment events specifically, a zero-tolerance policy (any payment webhook in the dead letter queue triggers an alert) is appropriate — these events have direct revenue impact and manual intervention is almost always justified.

Many teams use a webhook dashboard — a simple internal page showing event type counts, recent failures, and queue depth — so support and operations teams can self-serve answers during incidents without needing database access.

Methodology

Retry schedules in the provider table are sourced from each provider's official documentation as of early 2026: Stripe's webhook documentation, GitHub's webhook payload documentation, Twilio's webhooks guide, and Shopify's webhook authentication reference. Exact retry counts and schedules change — always verify against your provider's current docs before relying on specific values. The timingSafeEqual pattern for HMAC comparison prevents timing-based side-channel attacks and is recommended by OWASP for all secret comparison operations. BullMQ v5.x and Inngest v3.x are the current major versions as of 2026.


Find APIs with the best webhook support on APIScout — retry policies, signature verification docs, and event catalogs.

Related: How to Handle API Deprecation Notices, How to Handle API Rate Limits Gracefully, Building an AI Agent in 2026

The API Integration Checklist (Free PDF)

Step-by-step checklist: auth setup, rate limit handling, error codes, SDK evaluation, and pricing comparison for 50+ APIs. Used by 200+ developers.

Join 200+ developers. Unsubscribe in one click.