Data MigrationPostgreSQLDevOps

50,000+ Records: Zero-Downtime Data Migrations

January 12, 2024·10 min

How we migrated complex data between schemas with ID mapping, name splitting, enum transformations, and full audit trails—all without taking the system offline.

The Migration Problem

Schema changes in development are trivial. Add a column, rename a field, run the migration. In production with active users and fifty thousand records, the calculus changes entirely.

A naive migration that restructures a table can lock it for minutes. During that lock, every query fails. Users see error pages. Data written during the migration might be lost or corrupted. What seemed like a simple change becomes an incident.

The Fundamental Constraint

Database migrations must satisfy a constraint that feels impossible: the old code and new code must work simultaneously. You can't flip a switch and update everything atomically—deploys take time, caches need invalidation, workers are processing jobs with old schemas.

This constraint shapes everything. Every migration becomes a multi-phase operation where each phase is individually safe.

Additive Changes

The safest changes are purely additive. Adding a new table, adding a nullable column, adding an index—these don't break existing code because existing code doesn't know about them.

The principle: new structures first, then migration, then removal. Never the reverse.

The Dual-Write Pattern

When you need to move data from one structure to another, the dual-write pattern maintains consistency during transition. After deploying new tables, update your write path to write to both old and new locations within the same transaction.

This creates a period where both schemas are populated. New writes go to both. Old reads continue working. New reads can start using the new schema.

The dual-write is temporary scaffolding. Once migration is complete and reads have switched over, you remove the old write path.

Backfilling History

Dual-write handles new data, but historical data still lives only in the old schema. Backfilling moves it to the new structure without impacting production.

The key is batching with rate limiting. Process a thousand records, sleep briefly, repeat. This spreads the database load over time instead of creating a spike that affects user queries.

Idempotency matters here. The backfill might fail partway through and need restarting. Each batch should be safe to rerun—use upserts instead of inserts, track progress in a separate table.

Verification Before Switching

Before switching reads to the new schema, verify data integrity. Count comparisons catch missing records. Sample verification checks that relationships are preserved correctly. Automated integrity checks are cheaper than production incidents.

This is the phase where you discover edge cases: records that don't fit the new schema, data that was corrupted in the old system, assumptions that don't hold. Better to find them here than after the switch.

Gradual Traffic Shifting

Feature flags enable gradual rollout. Start with 10% of reads going to the new schema. Monitor error rates and latency. Increase to 25%, then 50%, then 100%.

If something goes wrong, you can instantly shift back to the old schema. The dual-write ensures both schemas have current data, so rollback is just a flag change.

Removing the Scaffolding

Once reads are fully migrated and stable—typically after a monitoring period—remove the old write path. The application now only knows about the new schema.

The final step is archiving and dropping the old tables. Keep an archive for a while in case you need to investigate historical issues. Then clean up.

Time Horizons

This pattern trades speed for safety. A migration that could be "done" in minutes instead takes weeks. But the alternative—a failed migration that corrupts data or causes downtime—is far more expensive.

The total engineering time is often similar. What changes is when you spend it: up front in careful planning, or afterward in incident response and data recovery.

The Deeper Principle

Zero-downtime migration reflects a broader philosophy about production systems. Users don't care about your internal changes—they care about the service working. Engineering effort should be invisible to them.

This requires treating the transition period as a first-class concern. The migration isn't done when the new schema exists. It's done when the old schema is removed and no code references it. Everything in between is the actual work.

Schema changes are easy. Schema changes without disruption are engineering.

Questions about this article?

Happy to dive deeper on any of these topics.

Get in Touch→