The Art of API Migration: Switching Providers Without Downtime

Switching API providers is the project nobody wants. It's risky, time-consuming, and usually triggered by something painful — a price hike, an outage, a deprecation notice. But done right, a migration can be smooth, zero-downtime, and even improve your system. Here's the playbook.

Why Companies Migrate

Trigger	Frequency	Urgency
Price increase	Common	Medium — negotiate first
Better alternative exists	Common	Low — plan carefully
Reliability issues	Occasional	High — after major incident
Feature gap	Occasional	Medium — evaluate alternatives
Acquisition/deprecation	Rare	High — forced migration
Compliance requirement	Rare	High — regulatory deadline
Vendor lock-in escape	Occasional	Low — strategic decision

The Migration Playbook

Phase 1: Assessment (1-2 weeks)

Before writing any code, answer these questions:

## Migration Assessment Checklist

### Current State
- [ ] Document all endpoints you use (not all available — just what you call)
- [ ] List all data stored with the current provider
- [ ] Map all webhook handlers and event types
- [ ] Identify SDK usage across your codebase
- [ ] Check contractual obligations (notice period, data export rights)
- [ ] Measure current performance baselines (latency, uptime, error rate)

### Target State
- [ ] Verify feature parity for YOUR use cases
- [ ] Test target provider's API with your actual data shapes
- [ ] Compare pricing at your usage level (not just list price)
- [ ] Check SDK quality (types, error handling, documentation)
- [ ] Verify compliance requirements (SOC 2, GDPR, etc.)

### Migration Scope
- [ ] Estimate code changes (endpoints, models, error handling)
- [ ] Identify data migration needs (users, subscriptions, history)
- [ ] List integration points (webhooks, SDKs, admin dashboards)
- [ ] Assess team training needs
- [ ] Set rollback criteria

Phase 2: Abstraction Layer (1 week)

If you don't already have one, add an abstraction layer:

// Create an interface that abstracts the provider
interface EmailProvider {
  sendEmail(params: {
    to: string;
    subject: string;
    html: string;
    from?: string;
  }): Promise<{ id: string }>;

  getEmailStatus(id: string): Promise<'delivered' | 'bounced' | 'pending'>;
}

// Current provider implementation
class SendGridProvider implements EmailProvider {
  async sendEmail(params) {
    const response = await sgMail.send({
      to: params.to,
      from: params.from || 'hello@company.com',
      subject: params.subject,
      html: params.html,
    });
    return { id: response[0].headers['x-message-id'] };
  }

  async getEmailStatus(id: string) { /* ... */ }
}

// New provider implementation (write alongside, don't replace yet)
class ResendProvider implements EmailProvider {
  async sendEmail(params) {
    const result = await resend.emails.send({
      to: params.to,
      from: params.from || 'hello@company.com',
      subject: params.subject,
      html: params.html,
    });
    return { id: result.data!.id };
  }

  async getEmailStatus(id: string) { /* ... */ }
}

Key principle: Make the switch a configuration change, not a code change.

Phase 3: Parallel Running (1-2 weeks)

Run both providers simultaneously to verify behavior:

class DualEmailProvider implements EmailProvider {
  constructor(
    private primary: EmailProvider,   // Current (SendGrid)
    private secondary: EmailProvider, // New (Resend)
    private shadowPercent: number = 10, // % of traffic to shadow
  ) {}

  async sendEmail(params) {
    // Always send through primary
    const result = await this.primary.sendEmail(params);

    // Shadow send through secondary (don't fail if it errors)
    if (Math.random() * 100 < this.shadowPercent) {
      try {
        const shadowResult = await this.secondary.sendEmail({
          ...params,
          to: `shadow-test+${Date.now()}@company.com`, // Don't email real users!
        });
        this.logComparison(result, shadowResult);
      } catch (error) {
        this.logShadowError(error);
      }
    }

    return result;
  }

  private logComparison(primary: any, secondary: any) {
    // Compare response times, formats, behavior
    console.log('Shadow comparison:', { primary, secondary });
  }
}

Shadow testing rules:

Never send shadow traffic to real users
Use test/sandbox endpoints or internal addresses
Compare response formats, latency, error handling
Run for at least 1 week before switching

Phase 4: Data Migration

// Data migration depends on category:

// PAYMENT MIGRATION (Stripe → other)
// Most complex — must migrate:
// - Customer records
// - Payment methods (usually NOT portable — re-collect)
// - Subscription data
// - Transaction history (for your records, not the new provider)

// EMAIL MIGRATION (SendGrid → Resend)
// Moderate — migrate:
// - DNS records (SPF, DKIM, DMARC)
// - Sender verification
// - Template mappings
// - Suppression lists (bounced emails)

// AUTH MIGRATION (Auth0 → Clerk)
// Complex — migrate:
// - User accounts (password hashes may not be portable)
// - Social connections
// - MFA settings
// - Session management
// - RBAC policies

// SEARCH MIGRATION (Algolia → Typesense)
// Moderate — migrate:
// - Index data (re-index from your database)
// - Search configuration (relevance, synonyms, filters)
// - API query format changes

Phase 5: Traffic Cutover

// Gradual traffic shift using feature flags
class MigratingEmailProvider implements EmailProvider {
  constructor(
    private old: EmailProvider,
    private new_: EmailProvider,
  ) {}

  async sendEmail(params) {
    // Feature flag controls rollout
    const useNew = await featureFlag.isEnabled('use-resend', {
      percent: getPhase(), // 0% → 10% → 50% → 100%
    });

    if (useNew) {
      try {
        return await this.new_.sendEmail(params);
      } catch (error) {
        // Fallback to old provider on error during migration
        console.error('New provider failed, falling back:', error);
        return await this.old.sendEmail(params);
      }
    }

    return await this.old.sendEmail(params);
  }
}

// Rollout schedule:
// Day 1: 0% (shadow testing only)
// Day 3: 10% (early adopters, monitor closely)
// Day 5: 25% (broader testing)
// Day 7: 50% (half traffic)
// Day 10: 100% (full migration)
// Day 17: Remove old provider code

Phase 6: Cleanup

## Post-Migration Checklist

- [ ] Old provider SDK removed from dependencies
- [ ] Old provider env vars removed
- [ ] Feature flags cleaned up
- [ ] Old webhook endpoints decommissioned
- [ ] DNS records updated (email: SPF/DKIM)
- [ ] Monitoring updated for new provider
- [ ] Team documentation updated
- [ ] Old provider account downgraded or closed
- [ ] Data export from old provider (for records)
- [ ] Runbook updated with new provider procedures

Category-Specific Migration Guides

Payment Provider Migration

Difficulty: Very High

Key challenges:
- Payment methods can't be transferred (cards must be re-collected)
- Active subscriptions need careful handling
- PCI compliance during transition
- Financial reconciliation

Approach:
1. New users → new provider immediately
2. Existing users → dual-write during transition
3. Subscription renewal → migrate at next billing cycle
4. Communicate to customers about re-entering payment info

Auth Provider Migration

Difficulty: High

Key challenges:
- Password hashes may use different algorithms
- Social connection tokens need re-authorization
- Active sessions during cutover
- MFA device re-enrollment

Approach:
1. Bulk import users (most auth providers support this)
2. Force password reset for users with non-portable hashes
3. Social logins: re-link on next login
4. Cut over login page, not sessions (existing sessions stay valid)

Email Provider Migration

Difficulty: Medium

Key challenges:
- DNS propagation for SPF/DKIM
- IP reputation with new provider
- Suppression list transfer
- Template format differences

Approach:
1. Set up DNS records for new provider alongside old
2. Warm up new provider's sending reputation
3. Import suppression lists
4. Migrate templates
5. Switch traffic gradually

Search Provider Migration

Difficulty: Medium

Key challenges:
- Query syntax differences
- Relevance tuning needs re-work
- Re-indexing all data
- Search analytics continuity

Approach:
1. Re-index from your source of truth (database, not old index)
2. A/B test search quality before full switch
3. Map old query syntax to new
4. Monitor search metrics after switch

Rollback Plan

Every migration needs a rollback plan:

// Rollback criteria (define BEFORE starting)
const ROLLBACK_CRITERIA = {
  errorRate: 0.05,      // >5% error rate
  latencyP99: 2000,     // >2s P99 latency
  downtime: 60,         // >60 seconds downtime
  dataLoss: 0,          // Any data loss = immediate rollback
};

// Rollback procedure
async function rollback() {
  // 1. Switch feature flag to 0% (all traffic to old provider)
  await featureFlag.disable('use-new-provider');

  // 2. Verify old provider is handling traffic
  await healthCheck.verify('old-provider');

  // 3. Alert team
  await alert('API migration rolled back — investigating');

  // 4. Do NOT delete new provider setup (may resume later)
}

Common Mistakes

Mistake	Impact	Fix
Big-bang cutover	All-or-nothing, no rollback	Gradual traffic shift
No abstraction layer	Migration requires changing every file	Build abstraction first
Skipping parallel running	Bugs found in production	Shadow test for 1+ week
Forgetting webhook migration	Missing events after switch	Migrate webhooks BEFORE cutover
Migrating data, not re-syncing	Stale data in new provider	Re-sync from source of truth
No rollback plan	Can't recover if migration fails	Define rollback criteria upfront
Rushing to delete old provider	No fallback if issues emerge	Keep old provider active for 30 days

Measuring Migration Success

A migration isn't done when the last line of old provider code is deleted — it's done when you've confirmed the new provider is performing at least as well as the old one across every dimension that matters. Define your success metrics before you start, so you're comparing against a baseline rather than guessing.

Performance metrics to track:

Latency (P50, P95, P99): Capture these during your parallel running phase. If the new provider's P99 is 400ms and your old provider was 150ms, you need to understand why before completing the cutover. Network topology, region selection, and connection pooling all affect this.
Error rate: Track 4xx and 5xx separately. A spike in 4xx errors often means API contract differences — request shapes or auth formats — that weren't caught in shadow testing. 5xx spikes usually mean the new provider is having capacity issues.
Throughput: Can the new provider handle your peak load? Load test at 2x your typical peak before going to 100%.
Cost per unit: Track cost per email sent, cost per authentication, cost per API call. The sticker price often differs from actual cost at your specific usage pattern.

Business metrics to watch for 30 days post-migration:

User-facing error rates (login failures, payment failures, missed emails)
Support tickets mentioning the affected feature
Revenue impact for payment migrations

For payment migrations specifically, track authorization rates. A 2% drop in authorization rates on a $1M/month business is a $20K/month problem that might not show up in technical monitoring.

Set a 30-day window where the old provider remains partially configured (even if at 0% traffic) so you can roll back without rebuilding the entire integration from scratch. Most migration postmortems involve an issue that appeared two weeks after cutover, not two days.

The True Cost of Migration

Engineers consistently underestimate migration costs. The API integration work is visible and estimable. Everything else is hidden.

Hidden costs that appear late:

DNS propagation delays: SPF/DKIM changes for email providers can take 24-48 hours to fully propagate worldwide. Plan for a window where some mail servers see old records and some see new ones.
SDK updates across your codebase: If you're using the provider SDK directly (rather than an abstraction layer), you'll find it in more places than you expect. SDK method signatures, error types, and retry behavior all differ.
Team retraining: Your on-call engineers need to know the new provider's dashboard, error codes, and support contact process. A 2am incident is the wrong time to learn that the new provider's status page is at a different URL.
Documentation updates: Internal runbooks, architecture diagrams, deployment checklists, and data flow diagrams all reference the old provider. If you don't update these, the next engineer will be confused.
Parallel running costs: During the shadow testing and gradual cutover phases, you're paying for both providers. For high-volume email or payments, this can be significant.
Suppression list and data cleanup: Moving suppression lists, migrating templates, re-verifying sender domains — each takes hours of focused work that isn't captured in "write the abstraction layer."

A realistic rule of thumb: the migration takes 3-4x longer than the initial estimate, and costs 2x more when you include parallel running, team time, and the inevitable issues that surface during cutover.

This math changes the ROI calculation. If you're migrating to save $500/month, and the migration costs 40 engineer-hours at $150/hour, you need 12 months just to break even — not counting the ongoing cost of maintaining a newer integration. Sometimes the right answer is to negotiate with your current provider rather than migrate.

When migration is clearly worth it:

Reliability problems: If your provider is causing customer-visible outages, the cost of migration is offset by incident costs and customer churn prevention.
Forced migration (deprecation or acquisition): You have no choice; optimize for speed and safety.
10x cost reduction: When your usage has scaled to the point where you're on enterprise pricing and a competitor offers equivalent service at a fraction of the cost.
Strategic: Moving to a provider with better developer tooling, better uptime SLAs, or a feature roadmap that aligns with where your product is going.

One underappreciated option: use the migration threat as negotiating leverage. Many providers will match competitor pricing or add features to retain customers when they know you're actively evaluating alternatives. Before you commit engineering resources to a migration, schedule a call with your account manager and share your evaluation criteria. The conversation alone sometimes resolves the issue in a day rather than six weeks. If the negotiation fails, you've already done the comparative evaluation and you can proceed with full confidence that migration is the right call.

Another consideration: migration windows matter. Avoid migrating during high-traffic periods (Black Friday for e-commerce, fiscal quarter-end for B2B), during major product launches, or when key engineers are on leave. Build a 30-day buffer between "migration complete" and any high-stakes event. Even a smooth migration can surface latent issues that need engineering attention, and you want capacity to respond.

Methodology

The migration patterns in this guide are drawn from common practices documented across engineering blogs (Stripe, Twilio, Auth0), postmortem databases, and production API integrations across payment, email, auth, and search categories. Feature flag implementations reference LaunchDarkly and Unleash documentation. The traffic cutover schedule (0% → 10% → 25% → 50% → 100%) follows the gradual rollout pattern common in SRE practices, with timelines adjusted for low-volume vs. high-volume services.

For payment migrations specifically, Stripe's documentation on data migration and the Stripe Atlas engineering blog provide additional context on subscription portability and card re-collection requirements. Auth0's migration guides and Clerk's bulk import documentation cover the user import flow for auth provider migrations.

Rollback criteria values (5% error rate, 2s P99 latency, 60s downtime) are conservative starting points. Adjust based on your application's specific SLAs and user sensitivity. Teams with strict uptime SLAs (99.99%) should tighten these thresholds; early-stage teams with more tolerance for brief degradation can relax them. The key is defining the criteria before starting so that rollback decisions are made on data, not in-the-moment stress.

The six-phase playbook (Assessment → Abstraction Layer → Parallel Running → Data Migration → Traffic Cutover → Cleanup) is a general framework. Tier-0 migrations (a single API endpoint with no data storage) can compress phases 1-3 into a single day. Tier-3 migrations (payment providers with active subscriptions and millions of stored cards) can stretch to six months. Calibrate accordingly.

Compare API providers for easy migration on APIScout — feature parity checks, migration guides, and vendor comparison tools.

API Migration Playbook: Switching Providers 2026