API Migration Playbook: Switching Providers 2026
The Art of API Migration: Switching Providers Without Downtime
Switching API providers is the project nobody wants. It's risky, time-consuming, and usually triggered by something painful — a price hike, an outage, a deprecation notice. But done right, a migration can be smooth, zero-downtime, and even improve your system. Here's the playbook.
Why Companies Migrate
| Trigger | Frequency | Urgency |
|---|---|---|
| Price increase | Common | Medium — negotiate first |
| Better alternative exists | Common | Low — plan carefully |
| Reliability issues | Occasional | High — after major incident |
| Feature gap | Occasional | Medium — evaluate alternatives |
| Acquisition/deprecation | Rare | High — forced migration |
| Compliance requirement | Rare | High — regulatory deadline |
| Vendor lock-in escape | Occasional | Low — strategic decision |
The Migration Playbook
Phase 1: Assessment (1-2 weeks)
Before writing any code, answer these questions:
## Migration Assessment Checklist
### Current State
- [ ] Document all endpoints you use (not all available — just what you call)
- [ ] List all data stored with the current provider
- [ ] Map all webhook handlers and event types
- [ ] Identify SDK usage across your codebase
- [ ] Check contractual obligations (notice period, data export rights)
- [ ] Measure current performance baselines (latency, uptime, error rate)
### Target State
- [ ] Verify feature parity for YOUR use cases
- [ ] Test target provider's API with your actual data shapes
- [ ] Compare pricing at your usage level (not just list price)
- [ ] Check SDK quality (types, error handling, documentation)
- [ ] Verify compliance requirements (SOC 2, GDPR, etc.)
### Migration Scope
- [ ] Estimate code changes (endpoints, models, error handling)
- [ ] Identify data migration needs (users, subscriptions, history)
- [ ] List integration points (webhooks, SDKs, admin dashboards)
- [ ] Assess team training needs
- [ ] Set rollback criteria
Phase 2: Abstraction Layer (1 week)
If you don't already have one, add an abstraction layer:
// Create an interface that abstracts the provider
interface EmailProvider {
sendEmail(params: {
to: string;
subject: string;
html: string;
from?: string;
}): Promise<{ id: string }>;
getEmailStatus(id: string): Promise<'delivered' | 'bounced' | 'pending'>;
}
// Current provider implementation
class SendGridProvider implements EmailProvider {
async sendEmail(params) {
const response = await sgMail.send({
to: params.to,
from: params.from || 'hello@company.com',
subject: params.subject,
html: params.html,
});
return { id: response[0].headers['x-message-id'] };
}
async getEmailStatus(id: string) { /* ... */ }
}
// New provider implementation (write alongside, don't replace yet)
class ResendProvider implements EmailProvider {
async sendEmail(params) {
const result = await resend.emails.send({
to: params.to,
from: params.from || 'hello@company.com',
subject: params.subject,
html: params.html,
});
return { id: result.data!.id };
}
async getEmailStatus(id: string) { /* ... */ }
}
Key principle: Make the switch a configuration change, not a code change.
Phase 3: Parallel Running (1-2 weeks)
Run both providers simultaneously to verify behavior:
class DualEmailProvider implements EmailProvider {
constructor(
private primary: EmailProvider, // Current (SendGrid)
private secondary: EmailProvider, // New (Resend)
private shadowPercent: number = 10, // % of traffic to shadow
) {}
async sendEmail(params) {
// Always send through primary
const result = await this.primary.sendEmail(params);
// Shadow send through secondary (don't fail if it errors)
if (Math.random() * 100 < this.shadowPercent) {
try {
const shadowResult = await this.secondary.sendEmail({
...params,
to: `shadow-test+${Date.now()}@company.com`, // Don't email real users!
});
this.logComparison(result, shadowResult);
} catch (error) {
this.logShadowError(error);
}
}
return result;
}
private logComparison(primary: any, secondary: any) {
// Compare response times, formats, behavior
console.log('Shadow comparison:', { primary, secondary });
}
}
Shadow testing rules:
- Never send shadow traffic to real users
- Use test/sandbox endpoints or internal addresses
- Compare response formats, latency, error handling
- Run for at least 1 week before switching
Phase 4: Data Migration
// Data migration depends on category:
// PAYMENT MIGRATION (Stripe → other)
// Most complex — must migrate:
// - Customer records
// - Payment methods (usually NOT portable — re-collect)
// - Subscription data
// - Transaction history (for your records, not the new provider)
// EMAIL MIGRATION (SendGrid → Resend)
// Moderate — migrate:
// - DNS records (SPF, DKIM, DMARC)
// - Sender verification
// - Template mappings
// - Suppression lists (bounced emails)
// AUTH MIGRATION (Auth0 → Clerk)
// Complex — migrate:
// - User accounts (password hashes may not be portable)
// - Social connections
// - MFA settings
// - Session management
// - RBAC policies
// SEARCH MIGRATION (Algolia → Typesense)
// Moderate — migrate:
// - Index data (re-index from your database)
// - Search configuration (relevance, synonyms, filters)
// - API query format changes
Phase 5: Traffic Cutover
// Gradual traffic shift using feature flags
class MigratingEmailProvider implements EmailProvider {
constructor(
private old: EmailProvider,
private new_: EmailProvider,
) {}
async sendEmail(params) {
// Feature flag controls rollout
const useNew = await featureFlag.isEnabled('use-resend', {
percent: getPhase(), // 0% → 10% → 50% → 100%
});
if (useNew) {
try {
return await this.new_.sendEmail(params);
} catch (error) {
// Fallback to old provider on error during migration
console.error('New provider failed, falling back:', error);
return await this.old.sendEmail(params);
}
}
return await this.old.sendEmail(params);
}
}
// Rollout schedule:
// Day 1: 0% (shadow testing only)
// Day 3: 10% (early adopters, monitor closely)
// Day 5: 25% (broader testing)
// Day 7: 50% (half traffic)
// Day 10: 100% (full migration)
// Day 17: Remove old provider code
Phase 6: Cleanup
## Post-Migration Checklist
- [ ] Old provider SDK removed from dependencies
- [ ] Old provider env vars removed
- [ ] Feature flags cleaned up
- [ ] Old webhook endpoints decommissioned
- [ ] DNS records updated (email: SPF/DKIM)
- [ ] Monitoring updated for new provider
- [ ] Team documentation updated
- [ ] Old provider account downgraded or closed
- [ ] Data export from old provider (for records)
- [ ] Runbook updated with new provider procedures
Category-Specific Migration Guides
Payment Provider Migration
Difficulty: Very High
Key challenges:
- Payment methods can't be transferred (cards must be re-collected)
- Active subscriptions need careful handling
- PCI compliance during transition
- Financial reconciliation
Approach:
1. New users → new provider immediately
2. Existing users → dual-write during transition
3. Subscription renewal → migrate at next billing cycle
4. Communicate to customers about re-entering payment info
Auth Provider Migration
Difficulty: High
Key challenges:
- Password hashes may use different algorithms
- Social connection tokens need re-authorization
- Active sessions during cutover
- MFA device re-enrollment
Approach:
1. Bulk import users (most auth providers support this)
2. Force password reset for users with non-portable hashes
3. Social logins: re-link on next login
4. Cut over login page, not sessions (existing sessions stay valid)
Email Provider Migration
Difficulty: Medium
Key challenges:
- DNS propagation for SPF/DKIM
- IP reputation with new provider
- Suppression list transfer
- Template format differences
Approach:
1. Set up DNS records for new provider alongside old
2. Warm up new provider's sending reputation
3. Import suppression lists
4. Migrate templates
5. Switch traffic gradually
Search Provider Migration
Difficulty: Medium
Key challenges:
- Query syntax differences
- Relevance tuning needs re-work
- Re-indexing all data
- Search analytics continuity
Approach:
1. Re-index from your source of truth (database, not old index)
2. A/B test search quality before full switch
3. Map old query syntax to new
4. Monitor search metrics after switch
Rollback Plan
Every migration needs a rollback plan:
// Rollback criteria (define BEFORE starting)
const ROLLBACK_CRITERIA = {
errorRate: 0.05, // >5% error rate
latencyP99: 2000, // >2s P99 latency
downtime: 60, // >60 seconds downtime
dataLoss: 0, // Any data loss = immediate rollback
};
// Rollback procedure
async function rollback() {
// 1. Switch feature flag to 0% (all traffic to old provider)
await featureFlag.disable('use-new-provider');
// 2. Verify old provider is handling traffic
await healthCheck.verify('old-provider');
// 3. Alert team
await alert('API migration rolled back — investigating');
// 4. Do NOT delete new provider setup (may resume later)
}
Common Mistakes
| Mistake | Impact | Fix |
|---|---|---|
| Big-bang cutover | All-or-nothing, no rollback | Gradual traffic shift |
| No abstraction layer | Migration requires changing every file | Build abstraction first |
| Skipping parallel running | Bugs found in production | Shadow test for 1+ week |
| Forgetting webhook migration | Missing events after switch | Migrate webhooks BEFORE cutover |
| Migrating data, not re-syncing | Stale data in new provider | Re-sync from source of truth |
| No rollback plan | Can't recover if migration fails | Define rollback criteria upfront |
| Rushing to delete old provider | No fallback if issues emerge | Keep old provider active for 30 days |
Measuring Migration Success
A migration isn't done when the last line of old provider code is deleted — it's done when you've confirmed the new provider is performing at least as well as the old one across every dimension that matters. Define your success metrics before you start, so you're comparing against a baseline rather than guessing.
Performance metrics to track:
- Latency (P50, P95, P99): Capture these during your parallel running phase. If the new provider's P99 is 400ms and your old provider was 150ms, you need to understand why before completing the cutover. Network topology, region selection, and connection pooling all affect this.
- Error rate: Track 4xx and 5xx separately. A spike in 4xx errors often means API contract differences — request shapes or auth formats — that weren't caught in shadow testing. 5xx spikes usually mean the new provider is having capacity issues.
- Throughput: Can the new provider handle your peak load? Load test at 2x your typical peak before going to 100%.
- Cost per unit: Track cost per email sent, cost per authentication, cost per API call. The sticker price often differs from actual cost at your specific usage pattern.
Business metrics to watch for 30 days post-migration:
- User-facing error rates (login failures, payment failures, missed emails)
- Support tickets mentioning the affected feature
- Revenue impact for payment migrations
For payment migrations specifically, track authorization rates. A 2% drop in authorization rates on a $1M/month business is a $20K/month problem that might not show up in technical monitoring.
Set a 30-day window where the old provider remains partially configured (even if at 0% traffic) so you can roll back without rebuilding the entire integration from scratch. Most migration postmortems involve an issue that appeared two weeks after cutover, not two days.
The True Cost of Migration
Engineers consistently underestimate migration costs. The API integration work is visible and estimable. Everything else is hidden.
Hidden costs that appear late:
- DNS propagation delays: SPF/DKIM changes for email providers can take 24-48 hours to fully propagate worldwide. Plan for a window where some mail servers see old records and some see new ones.
- SDK updates across your codebase: If you're using the provider SDK directly (rather than an abstraction layer), you'll find it in more places than you expect. SDK method signatures, error types, and retry behavior all differ.
- Team retraining: Your on-call engineers need to know the new provider's dashboard, error codes, and support contact process. A 2am incident is the wrong time to learn that the new provider's status page is at a different URL.
- Documentation updates: Internal runbooks, architecture diagrams, deployment checklists, and data flow diagrams all reference the old provider. If you don't update these, the next engineer will be confused.
- Parallel running costs: During the shadow testing and gradual cutover phases, you're paying for both providers. For high-volume email or payments, this can be significant.
- Suppression list and data cleanup: Moving suppression lists, migrating templates, re-verifying sender domains — each takes hours of focused work that isn't captured in "write the abstraction layer."
A realistic rule of thumb: the migration takes 3-4x longer than the initial estimate, and costs 2x more when you include parallel running, team time, and the inevitable issues that surface during cutover.
This math changes the ROI calculation. If you're migrating to save $500/month, and the migration costs 40 engineer-hours at $150/hour, you need 12 months just to break even — not counting the ongoing cost of maintaining a newer integration. Sometimes the right answer is to negotiate with your current provider rather than migrate.
When migration is clearly worth it:
- Reliability problems: If your provider is causing customer-visible outages, the cost of migration is offset by incident costs and customer churn prevention.
- Forced migration (deprecation or acquisition): You have no choice; optimize for speed and safety.
- 10x cost reduction: When your usage has scaled to the point where you're on enterprise pricing and a competitor offers equivalent service at a fraction of the cost.
- Strategic: Moving to a provider with better developer tooling, better uptime SLAs, or a feature roadmap that aligns with where your product is going.
One underappreciated option: use the migration threat as negotiating leverage. Many providers will match competitor pricing or add features to retain customers when they know you're actively evaluating alternatives. Before you commit engineering resources to a migration, schedule a call with your account manager and share your evaluation criteria. The conversation alone sometimes resolves the issue in a day rather than six weeks. If the negotiation fails, you've already done the comparative evaluation and you can proceed with full confidence that migration is the right call.
Another consideration: migration windows matter. Avoid migrating during high-traffic periods (Black Friday for e-commerce, fiscal quarter-end for B2B), during major product launches, or when key engineers are on leave. Build a 30-day buffer between "migration complete" and any high-stakes event. Even a smooth migration can surface latent issues that need engineering attention, and you want capacity to respond.
Methodology
The migration patterns in this guide are drawn from common practices documented across engineering blogs (Stripe, Twilio, Auth0), postmortem databases, and production API integrations across payment, email, auth, and search categories. Feature flag implementations reference LaunchDarkly and Unleash documentation. The traffic cutover schedule (0% → 10% → 25% → 50% → 100%) follows the gradual rollout pattern common in SRE practices, with timelines adjusted for low-volume vs. high-volume services.
For payment migrations specifically, Stripe's documentation on data migration and the Stripe Atlas engineering blog provide additional context on subscription portability and card re-collection requirements. Auth0's migration guides and Clerk's bulk import documentation cover the user import flow for auth provider migrations.
Rollback criteria values (5% error rate, 2s P99 latency, 60s downtime) are conservative starting points. Adjust based on your application's specific SLAs and user sensitivity. Teams with strict uptime SLAs (99.99%) should tighten these thresholds; early-stage teams with more tolerance for brief degradation can relax them. The key is defining the criteria before starting so that rollback decisions are made on data, not in-the-moment stress.
The six-phase playbook (Assessment → Abstraction Layer → Parallel Running → Data Migration → Traffic Cutover → Cleanup) is a general framework. Tier-0 migrations (a single API endpoint with no data storage) can compress phases 1-3 into a single day. Tier-3 migrations (payment providers with active subscriptions and millions of stored cards) can stretch to six months. Calibrate accordingly.
Compare API providers for easy migration on APIScout — feature parity checks, migration guides, and vendor comparison tools.
Related: OpenAPI 3.2: What's New & Migration Guide 2026, How AI Is Transforming API Design and Documentation, API Breaking Changes Without Breaking Clients