How to Handle Webhook Failures and Retries 2026
How to Handle Webhook Failures and Retries
Webhooks are fire-and-forget from the sender's perspective. If your handler crashes, times out, or returns an error, the webhook provider retries — sometimes for days. Handling this correctly means your app processes every event exactly once, even when things go wrong.
How Webhook Retries Work
Provider Retry Policies
| Provider | Max Retries | Retry Schedule | Timeout |
|---|---|---|---|
| Stripe | ~15 over 3 days | Exponential backoff | 20 seconds |
| GitHub | 3 | 10s, 60s, 360s | 10 seconds |
| Twilio | Up to 14 | Exponential | 15 seconds |
| Shopify | 19 over 48 hours | Exponential | 5 seconds |
| PayPal | 15 over 3 days | Exponential | 30 seconds |
| Clerk | Multiple over 3 days | Exponential | 30 seconds |
What Triggers a Retry
| Response | Provider Action |
|---|---|
| 2xx (200-299) | ✅ Success — no retry |
| 3xx (redirect) | ❌ Treated as failure, retries |
| 4xx (client error) | ⚠️ Varies — some providers stop, others retry |
| 5xx (server error) | ❌ Retry with backoff |
| Timeout | ❌ Retry with backoff |
| Connection refused | ❌ Retry with backoff |
Pattern 1: Fast Acknowledgment
Return 200 immediately, process asynchronously:
// ❌ Bad: Process synchronously (can timeout)
app.post('/webhooks/stripe', async (req, res) => {
const event = verifySignature(req);
await updateDatabase(event); // 500ms
await sendNotification(event); // 300ms
await updateAnalytics(event); // 200ms
res.status(200).send('OK'); // Total: 1s+ (might timeout)
});
// ✅ Good: Acknowledge fast, process async
app.post('/webhooks/stripe', async (req, res) => {
// 1. Verify signature (fast — <10ms)
const event = verifySignature(req);
// 2. Store raw event (fast — <50ms)
await db.webhookEvents.create({
id: event.id,
type: event.type,
payload: event,
status: 'pending',
receivedAt: new Date(),
});
// 3. Acknowledge immediately
res.status(200).send('OK');
// 4. Process asynchronously
processWebhookAsync(event).catch(error => {
console.error('Webhook processing failed:', error);
});
});
Pattern 2: Idempotent Processing
Webhooks can be delivered multiple times. Process each event exactly once:
async function processWebhookEvent(event: WebhookEvent): Promise<void> {
// Check if already processed
const existing = await db.webhookEvents.findById(event.id);
if (existing?.status === 'processed') {
console.log(`Event ${event.id} already processed, skipping`);
return;
}
// Use a transaction to prevent race conditions
await db.transaction(async (tx) => {
// Double-check inside transaction (another worker might have started)
const locked = await tx.webhookEvents.findByIdForUpdate(event.id);
if (locked?.status === 'processed') return;
// Process the event
await handleEvent(event, tx);
// Mark as processed
await tx.webhookEvents.update(event.id, {
status: 'processed',
processedAt: new Date(),
});
});
}
async function handleEvent(event: WebhookEvent, tx: Transaction) {
switch (event.type) {
case 'payment_intent.succeeded':
// Use idempotency key for downstream operations too
await fulfillOrder(event.data.object.id, tx);
break;
case 'customer.subscription.deleted':
await deactivateSubscription(event.data.object.id, tx);
break;
// ... other event types
}
}
Pattern 3: Signature Verification
Always verify webhook signatures to prevent forgery:
import crypto from 'crypto';
// Stripe signature verification
function verifyStripeSignature(
payload: string, // Raw body string, NOT parsed JSON
signature: string,
secret: string
): boolean {
const elements = signature.split(',');
const timestamp = elements.find(e => e.startsWith('t='))?.slice(2);
const v1Signature = elements.find(e => e.startsWith('v1='))?.slice(3);
if (!timestamp || !v1Signature) return false;
// Prevent replay attacks (reject if older than 5 minutes)
const now = Math.floor(Date.now() / 1000);
if (now - parseInt(timestamp) > 300) return false;
const signedPayload = `${timestamp}.${payload}`;
const expected = crypto
.createHmac('sha256', secret)
.update(signedPayload)
.digest('hex');
return crypto.timingSafeEqual(
Buffer.from(v1Signature),
Buffer.from(expected)
);
}
// Generic HMAC verification (works for most providers)
function verifyHmacSignature(
payload: string,
signature: string,
secret: string,
algorithm: string = 'sha256'
): boolean {
const expected = crypto
.createHmac(algorithm, secret)
.update(payload)
.digest('hex');
return crypto.timingSafeEqual(
Buffer.from(signature),
Buffer.from(expected)
);
}
Critical: Read the raw request body as a string, NOT parsed JSON. Parsing then re-stringifying changes the payload and breaks signature verification.
// Next.js App Router — get raw body
export async function POST(request: Request) {
const rawBody = await request.text();
const signature = request.headers.get('stripe-signature')!;
if (!verifyStripeSignature(rawBody, signature, WEBHOOK_SECRET)) {
return new Response('Invalid signature', { status: 401 });
}
const event = JSON.parse(rawBody);
// ... process event
}
Pattern 4: Dead Letter Queue
When processing fails after all retries, don't lose the event:
class WebhookProcessor {
async process(event: WebhookEvent): Promise<void> {
const MAX_INTERNAL_RETRIES = 3;
for (let attempt = 0; attempt < MAX_INTERNAL_RETRIES; attempt++) {
try {
await this.handleEvent(event);
await this.markProcessed(event.id);
return;
} catch (error) {
console.error(`Attempt ${attempt + 1} failed for event ${event.id}:`, error);
if (attempt < MAX_INTERNAL_RETRIES - 1) {
await new Promise(r => setTimeout(r, Math.pow(2, attempt) * 1000));
}
}
}
// All retries failed — move to dead letter queue
await this.moveToDeadLetter(event);
}
private async moveToDeadLetter(event: WebhookEvent) {
await db.deadLetterQueue.create({
eventId: event.id,
eventType: event.type,
payload: event,
failedAt: new Date(),
retryCount: 0,
});
// Alert team
await alertSlack(`⚠️ Webhook event failed permanently: ${event.type} (${event.id})`);
}
}
// Admin tool: retry dead letter events
async function retryDeadLetterEvents() {
const failed = await db.deadLetterQueue.findAll({ status: 'failed' });
for (const item of failed) {
try {
await processor.handleEvent(item.payload);
await db.deadLetterQueue.update(item.id, { status: 'resolved' });
console.log(`Resolved dead letter event: ${item.eventId}`);
} catch (error) {
await db.deadLetterQueue.update(item.id, {
retryCount: item.retryCount + 1,
lastError: String(error),
});
}
}
}
Pattern 5: Event Ordering
Webhooks may arrive out of order. Handle this:
// Problem: "subscription.updated" arrives before "subscription.created"
// Solution: Use event timestamps and idempotent operations
async function handleSubscriptionEvent(event: WebhookEvent) {
const subscription = event.data.object;
await db.subscriptions.upsert({
id: subscription.id,
// Only update if this event is newer than what we have
where: {
id: subscription.id,
updatedAt: { lt: new Date(event.created * 1000) },
},
create: {
id: subscription.id,
status: subscription.status,
customerId: subscription.customer,
updatedAt: new Date(event.created * 1000),
},
update: {
status: subscription.status,
updatedAt: new Date(event.created * 1000),
},
});
}
Pattern 6: Monitoring Webhook Health
class WebhookMonitor {
async recordEvent(eventId: string, type: string, status: 'received' | 'processed' | 'failed') {
await db.webhookMetrics.create({
eventId,
type,
status,
timestamp: new Date(),
});
}
async getHealth(hours: number = 24) {
const since = new Date(Date.now() - hours * 3600000);
const events = await db.webhookMetrics.findMany({
where: { timestamp: { gte: since } },
});
const received = events.filter(e => e.status === 'received').length;
const processed = events.filter(e => e.status === 'processed').length;
const failed = events.filter(e => e.status === 'failed').length;
return {
received,
processed,
failed,
successRate: processed / received,
failureRate: failed / received,
alert: failed / received > 0.05 ? 'HIGH' : 'OK',
};
}
}
Testing Webhooks
// Generate test webhook events locally
import Stripe from 'stripe';
const stripe = new Stripe(process.env.STRIPE_SECRET_KEY!);
// Stripe CLI for local testing
// stripe listen --forward-to localhost:3000/webhooks/stripe
// stripe trigger payment_intent.succeeded
// Programmatic test
test('handles payment succeeded webhook', async () => {
const event = {
id: 'evt_test_123',
type: 'payment_intent.succeeded',
created: Math.floor(Date.now() / 1000),
data: {
object: {
id: 'pi_test_456',
amount: 2000,
status: 'succeeded',
customer: 'cus_test_789',
},
},
};
const payload = JSON.stringify(event);
const signature = stripe.webhooks.generateTestHeaderString({
payload,
secret: WEBHOOK_SECRET,
});
const response = await app.inject({
method: 'POST',
url: '/webhooks/stripe',
headers: {
'stripe-signature': signature,
'content-type': 'application/json',
},
body: payload,
});
expect(response.statusCode).toBe(200);
const order = await db.orders.findByPaymentIntent('pi_test_456');
expect(order.status).toBe('paid');
});
Common Mistakes
| Mistake | Impact | Fix |
|---|---|---|
| Processing synchronously | Handler timeouts, missed events | Acknowledge fast, process async |
| No idempotency | Duplicate processing on retries | Check event ID before processing |
| Parsing body before signature check | Signature verification fails | Use raw body string for verification |
| No dead letter queue | Failed events lost forever | Store failed events for manual retry |
| Assuming event order | Race conditions, data inconsistency | Use timestamps, idempotent operations |
| No webhook monitoring | Don't know when things break | Track success/failure rates |
Choosing Your Webhook Infrastructure
The patterns above work at any scale, but your infrastructure choices depend on your volume and team size.
For small teams (< 1,000 webhooks/day): The database-backed queue shown in Pattern 1 is sufficient. Use your existing PostgreSQL or MySQL instance. Add an index on (status, receivedAt) for efficient queue polling. A simple cron job every minute picks up pending events. This requires no additional infrastructure.
For medium volume (1K–100K/day): Consider a dedicated job queue like BullMQ (Redis-backed) or a managed service like Inngest. BullMQ gives you priority queues, delayed retries, and job visibility with minimal configuration. Inngest provides a managed fan-out layer with a replay UI, which is particularly useful when you need to replay events after a bug fix without re-triggering the webhook source.
For high volume (100K+/day): Dedicated message queues — AWS SQS, Google Pub/Sub, or Kafka — become cost-effective. SQS FIFO queues provide exactly-once delivery guarantees at the message level. Kafka is worth the operational overhead when you need event replay across multiple consumers or audit trails going back months.
One pattern worth adopting early: separate your webhook ingestion from your webhook processing. The ingestion layer (receive → verify → store → ack) should be a simple, fast endpoint that almost never fails. The processing layer (parse → handle → side effects) is where complexity lives. By keeping these decoupled, a processing bug never causes you to return 500 to the webhook sender, which avoids a cascade of retries from the provider.
Handling Provider-Specific Quirks
Each major webhook provider has idiosyncrasies worth knowing before you're debugging at 2am.
Stripe delivers events in roughly chronological order, but "roughly" is doing a lot of work in that sentence. If a customer's card is declined and then they update their card and pay successfully, you might receive payment_intent.payment_failed after payment_intent.succeeded. Always use the created timestamp on events, not arrival order, to determine the current state of an object. Stripe also resends events if you return a non-2xx after already processing them — your idempotency check must be bulletproof.
GitHub webhooks have a 10-second timeout — the tightest of any major provider. If you're processing GitHub events (push, pull request, check run), you must return 200 within 10 seconds or the delivery is marked failed. GitHub retries only 3 times with a short backoff, so missed events don't self-heal as gracefully as Stripe's 3-day retry window.
Shopify webhooks use HMAC-SHA256 with the raw body, but base64-encodes the HMAC in the X-Shopify-Hmac-SHA256 header. Don't compare it against a hex digest — compare base64 strings. Shopify also has a 5-second timeout and delivers to all your registered endpoints simultaneously on each trigger, so a single event can create multiple webhooks for different topics.
Svix (used by Clerk, Resend, and many others) standardizes webhook delivery with retries, portal UIs, and replay functionality. If you're building a platform where your customers receive webhooks from you, Svix is worth evaluating as a managed delivery layer rather than building retry infrastructure yourself.
Webhook Security Beyond Signature Verification
Signature verification is necessary but not sufficient for a production webhook system.
Allowlist provider IP ranges: Stripe, GitHub, and other major providers publish static IP ranges they use for outbound webhooks. Configure your firewall or load balancer to only accept webhook requests from these ranges. This prevents attackers from sending crafted payloads to your endpoint even if they somehow obtained your webhook secret.
Rate limiting per provider: Your webhook endpoint should have rate limiting specific to each provider. Stripe sends at most a few events per second under normal conditions; a sudden spike of 500 requests/second from "Stripe's IP range" is suspicious. A misconfigured client or an attacker can cause a DDoS via webhook flooding.
Secrets rotation: Rotate webhook secrets on a schedule (every 90 days is reasonable) and after any suspected breach. Most providers support rotating secrets with a brief dual-validation window so you can update your handler without dropping events. Store webhook secrets in your secrets manager (AWS Secrets Manager, HashiCorp Vault, or similar), not in environment variables baked into build artifacts.
Endpoint URL hygiene: Avoid putting provider names or internal routing details in your webhook URL path. /webhooks/stripe tells an attacker exactly what to craft. /api/events/inbound or /api/hooks/p1 with provider identification done via header is less informative to an attacker probing your endpoints. This is security through obscurity and should never replace signature verification, but it's a cheap layer of defense.
Replay detection window: Beyond the 5-minute timestamp tolerance in the signature verification code above, consider tracking event IDs in a cache (Redis with a 24-hour TTL works well) and rejecting duplicate IDs immediately. This is different from your idempotency check in the processing layer — this check happens at the ingestion layer and prevents even storing the event twice, which matters when you're tracking storage costs or auditing event volumes.
Observability for Webhook-Heavy Systems
If webhooks are central to your business logic (e.g., payment confirmations, subscription lifecycle), you need dedicated observability beyond application logs.
Track four key metrics over rolling 1-hour and 24-hour windows: events received, events processed successfully, events failed (after dead letter), and processing lag (time from receivedAt to processedAt). Processing lag is the most actionable metric — a rising lag means your workers are falling behind, which can compound during high-traffic periods.
Set up alerts that page on-call when: the failure rate exceeds 2% over a 1-hour window, processing lag exceeds 5 minutes for any event type, or the dead letter queue size exceeds a threshold. For payment events specifically, a zero-tolerance policy (any payment webhook in the dead letter queue triggers an alert) is appropriate — these events have direct revenue impact and manual intervention is almost always justified.
Many teams use a webhook dashboard — a simple internal page showing event type counts, recent failures, and queue depth — so support and operations teams can self-serve answers during incidents without needing database access.
Methodology
Retry schedules in the provider table are sourced from each provider's official documentation as of early 2026: Stripe's webhook documentation, GitHub's webhook payload documentation, Twilio's webhooks guide, and Shopify's webhook authentication reference. Exact retry counts and schedules change — always verify against your provider's current docs before relying on specific values. The timingSafeEqual pattern for HMAC comparison prevents timing-based side-channel attacks and is recommended by OWASP for all secret comparison operations. BullMQ v5.x and Inngest v3.x are the current major versions as of 2026.
Find APIs with the best webhook support on APIScout — retry policies, signature verification docs, and event catalogs.
Related: How to Handle API Deprecation Notices, How to Handle API Rate Limits Gracefully, Building an AI Agent in 2026