Postmortem · Postgres 14.9 · Tenants Table Pp. 14 — 41

The migration that took six weeks.

The plan said two days. We had run the dry-run twice, on a staging cluster restored from a 14-hour pg_basebackup. Six hundred million rows, one partitioned tenants table, one new normalized form, three engineers, one very long Saturday. This is the diary of what actually happened.

612.4M rows
Source table size
42days
Actual duration
2days
Planned duration
19deploys
Rollouts & reverts
1incident
User-visible (Sev-2)
3:14am CET
The phone call
§ In this issue
  1. The brief, and why it was wrongp. 15 — 17
  2. Six weeks, week by weekp. 18 — 26
  3. The row-count chart, in prosep. 27 — 28
  4. The index that did not buildp. 29 — 31
  5. Tools we used, and what they costp. 32 — 34
  6. The 3:14 am phone callp. 35 — 37
  7. Lessons, in plain languagep. 38 — 41
§01 · The BriefAnd why it was wrong on day three

The plan was a two-page memo. The reality needed forty.

The two-page memo from January was correct in every line. It described the source: a partitioned tenants_v1 table that had grown, over four years, from a quick weekend prototype into the single most-queried object in our database — touched by 73% of all read transactions, written to by every signup, every billing tick, every audit hook. It described the target: three smaller tables, each with one clear owner, joined by surrogate keys. It described the method: dual-write for a week, backfill with lhm, cut over on a Sunday night, drop the old table after a 30-day grace. Two engineering days. One DBA on-call. Done.

What the memo did not describe, because we did not yet know to describe it, was the long tail of everything else: the four reporting jobs that SELECT'd from tenants_v1 by ordinal column position; the eleven internal services that had cached the table OID in a startup ritual nobody could remember writing; the marketing-export pipeline that ran every Wednesday at 04:00 UTC and held a SHARE lock for ninety minutes; the autovacuum that, on the apac partition, had been quietly losing ground for eight months and now needed 41 GB of temporary space to finish a single pass.

We discovered most of this in the first seventy-two hours. A few items we discovered later, at unkind hours, with the deploy lead on speakerphone and a Datadog dashboard turning red in three places at once. The shape of the migration changed roughly once a week thereafter. The shape of the team — who was sleeping, who was not, who had stopped answering Slack except in single-word messages — changed too.

This issue is the diary. We have removed nothing that was embarrassing, and we have removed only one name, at the request of a former colleague who left the company in March for reasons unrelated to this work. Everything else is here, with the timestamps as they happened.

§02 · DiaryWeek-by-week, with the row counts

Six weeks. Forty-two phases, condensed.

W1Feb 03 — Feb 09

Dual-write goes in. Quietly.

Tuesday: shipped the dual-write trigger to staging. It worked. Wednesday: shipped to production behind a feature flag at 12% of writes. By Friday we were at 100% of writes mirrored into accounts_new. P99 write latency rose 11ms, which we accepted.

By Sunday we had 4.1M rows backfilled out of 612.4M and our chart said we'd finish in 19 days. Idris drew the chart on the whiteboard in marker. We kept that whiteboard photo. It is, in hindsight, the most optimistic photograph in our shared drive.

Observations Dual-write trigger added +11 ms p99 write latency.
Backfill rate: 7.4 M rows/day (target 32 M).
First problem visible: nobody had measured replication lag under sustained COPY load.
W2Feb 10 — Feb 16

The autovacuum problem.

Monday morning the apac partition crossed the 200M dead-tuple line. pg_stat_user_tables went pink. Autovacuum had been running for nine hours and was at 38% on a single partition. We paused the backfill. Hee-Won spent two days tuning autovacuum_work_mem from 256 MB to 2 GB and rewriting our cost-delay settings.

By Sunday, vacuum was healthy again, but we had lost five working days. We re-drew Idris's whiteboard chart in pencil this time. It now said 31 days. Nobody mentioned it.

Observations Dead tuples (apac): 214 M peak.
Autovacuum runtime: 31 h 14 m.
Lesson logged: never start a migration with a known-broken vacuum.
W3Feb 17 — Feb 23

Backfill is going. Until pg_repack.

Wednesday at 22:40 CET we started a pg_repack on the eu partition of tenants_v1, to reclaim 18 GB of bloat before the cutover. The repack acquired its exclusive lock at 22:41 and held it for 41 seconds while it swapped the table. That was supposed to be the worst of it.

At 22:43 our checkout service started returning 502s. A bounded connection pool, sized at 64 per replica, had filled in the four seconds the swap took. Sev-3 was opened, paged Anna, paged Idris, resolved at 22:51 once the pool drained. No customer wrote in. We are not sure if anyone noticed. We did.

Observations Lock duration: 41 s (planned: < 5 s).
Saturated pools: 3 services.
First Sev-incident of the migration. No data loss.
W4Feb 24 — Mar 02

The index that would not build.

We started a CREATE INDEX CONCURRENTLY on memberships_new(workspace_id, account_id) INCLUDE (role) on Tuesday at 09:00. It ran for 34 hours and then it failed — not crashed, failed — with a NOT VALID index left behind in pg_index. The cause was a long-running transaction in a reporting job that we'd been told had been retired in October. It had not been retired. It ran every Tuesday at 14:00.

We dropped the invalid index, killed the reporting job, and started again on Thursday. The second attempt took 28 hours and succeeded. Total cost in wall-clock: 4 days and 6 hours, for a single index. We added one line to the runbook: grep every nightly cron in every repo before starting a CONCURRENTLY build.

Observations Failed build: 34 h, NOT VALID.
Second build: 28 h, valid.
Hidden cron found: billing_v1::weekly_export.rb.
W5Mar 03 — Mar 09

The almost-rollback.

Saturday night, 02:30 CET. The cutover was running. We had switched reads to accounts_new at 02:11 and the dashboards looked fine until they did not. At 02:28, P99 read latency on the auth path went from 4 ms to 612 ms. A junior on call, who had done everything right, paged Anna. Anna paged Idris. Idris paged Hee-Won, who was already awake because Hee-Won is always already awake.

By 02:46 we had decided to roll back. By 02:51 we had decided not to. The fix was a missing statistics target on memberships_new.account_id — 100 rather than 1000 — which had caused the planner to pick a nested-loop join over a 220 M-row table. ALTER TABLE, ANALYZE, hold breath, watch graphs flatten. By 03:14 we were green again. We did not sleep.

Observations Peak p99 latency: 612 ms (baseline: 4 ms).
Window between “abort” and “hold”: 5 minutes.
Root cause: a single missing ALTER TABLE ... SET STATISTICS.
W6Mar 10 — Mar 16

Drain, verify, retire.

Monday: writes cut over fully to the new tables. Wednesday: started the slow drop of tenants_v1 partitions — one per day, in case anything wanted to read from them. Friday: dropped the last one. We had run 19 deploys, opened 41 PRs, written 1,180 lines of migration code and deleted 2,260 lines of the old read-path.

Saturday afternoon, Anna closed the field log. Idris went for a walk along the Spree. Hee-Won, in Seoul, slept fourteen hours and woke up to a single Slack message from the CEO, sent at 14:02 Berlin time: thanks all, please do not do this again until we have all forgotten.

Observations Final cutover: clean, no Sev events.
Lines of code, net: −1,080.
Story points spent: 84 (planned: 13). Carry-over: 0.
We had the right plan for the database we thought we had. We did not have the right plan for the database we actually had. The difference was four years of weekend pull requests.
— Anna Vollmar, field notebook, March 09, 03:48 CET
§03 · The ChartRow counts, weeks, and what the curve hid

A curve that bent three times, only one of them upward.

Rows migrated, weekly cumulative

W14.1 M
W26.0 M
W3118.6 M
W4241.3 M
W5253.0 M
W6612.4 M

The chart, when we finally drew it for the company-wide retro, looked like a staircase someone had kicked. Weeks one and two are flat — 4.1 M then 6.0 M — because the autovacuum problem ate the second week whole. Week three rises sharply: that was lhm finally allowed to breathe, moving 112 M rows in seven days, an average of 16 M per day, which was three times faster than our optimistic model.

Week four levels off into a steady climb. Week five is the visible scar: only 11.7 M rows added in seven days, because the index-build catastrophe stole four of them. Week six is the cutover — the bar is full because at that point we no longer needed to copy; we needed only to switch.

The lesson buried in the chart is that the worst weeks were not the ones where progress was zero. They were the ones where progress looked normal and a problem was building underneath: a vacuum falling behind, a cron job nobody had read.

The cutover was twelve seconds long. The migration was six weeks long. Almost everything we learned, we learned in the gap between those two numbers.
Hee-Won Lee, SRE on-call · Seoul · all six weekends
§04 · IndexesWhat we built, what we kept, what we threw away

Twelve indexes in total. Two of them mattered.

# Index Role Build time Size Outcome
01 accounts_new_pkey Primary key (uuid) 02 h 11 m 9.4 GB Built clean
02 accounts_new_email_lower_uniq Unique constraint 04 h 38 m 11.2 GB Built clean
03 memberships_new_pkey Composite PK 06 h 02 m 14.8 GB Built clean
04 memberships_workspace_account_role Hot read path 34 h 00 m Aborted (NOT VALID)
05 memberships_workspace_account_role Hot read path (retry) 28 h 14 m 22.1 GB Built clean (2nd try)
06 workspaces_new_pkey Primary key 01 h 04 m 3.8 GB Built clean
07 workspaces_owner_id_btree FK lookup 01 h 58 m 5.1 GB Built clean
08 accounts_created_at_brin Reporting (range) 00 h 09 m 42 MB BRIN, kept
09 memberships_role_partial Admin lookups 00 h 52 m 280 MB Partial idx (role='owner')
10 accounts_legacy_id_btree Transitional 03 h 18 m 7.2 GB Dropped W6
11 workspaces_slug_trgm Search (pg_trgm) 02 h 41 m 4.4 GB Removed — unused
12 memberships_audit_ts_btree Compliance audit 07 h 26 m 12.0 GB Built clean
§05 · ToolsWhat we used, where it surprised us

Three tools. One opinion per tool.

01 · Online schema change

lhm

We chose lhm because Idris had used it on MySQL at a previous job and the Postgres port has a friendlier configuration surface than pg_repack for table reshapes. It does the trigger-and-copy dance and exposes throttling knobs we trusted.

Surprise: its throttle is in rows, not bytes. On a wide table with three TEXT columns averaging 612 bytes, this is a wildly misleading control. We rewrote our chunker to throttle on heap-bytes and never looked back.

used in W3 — W5verdict: kept
02 · Bloat reclamation

pg_repack

Used to reclaim 18 GB of bloat from the eu partition before cutover. Worked, in the technical sense. The lock window during table swap blew through our connection pools and gave us the only customer-visible incident of the project.

Opinion: pg_repack is fine. The way most teams call it — including ours, until that night — is not fine. Drain the pool first, route reads to a replica during the swap second, repack third.

used in W3verdict: kept, with rituals
03 · MySQL exile

gh-ost

We did not use gh-ost. We considered it for one afternoon, because of fond memories from previous lives, and then we read its README again and remembered that it is MySQL-only.

This card exists so that the next engineer who asks could we have used gh-ost can read it once and stop. The Postgres equivalent we wanted was lhm plus discipline. There is no shortcut.

considered W0verdict: not applicable
§06 · The Call03:14 am CET, March 09

The phone rang. We were five minutes from a rollback.

The thing the runbook does not tell you is what your own voice sounds like at 3am when you have to decide whether to roll back a migration you have spent thirty-seven days building. It sounds, in my case, very calm and very far away. I was sitting in the kitchen because my desk chair was in the bedroom and my partner was asleep. I had a glass of water and a printout of the cutover plan with three things crossed out.

Idris was on the call from his hotel room in Lisbon — he had been at a conference, he had taken the red-eye, he was running on tea and adrenaline. Hee-Won was in Seoul, on her third coffee, sharing a Datadog window where the auth-svc p99 latency line had gone from a calm 4 ms to a 612 ms cliff. We had agreed two weeks earlier on a rule: if p99 stayed above 200 ms for more than five minutes on the auth path, we would roll back. We were at minute three.

At minute four, Hee-Won said the words that saved the migration: this looks like a planner problem, not a data problem. She had pulled an EXPLAIN ANALYZE of the slow query and seen a nested loop where, on staging, there had been a hash join. The default statistics target of 100, fine for the old narrow table, was lying to the planner about the new wide one. Anna ran the ALTER TABLE memberships_new ALTER COLUMN account_id SET STATISTICS 1000; followed by ANALYZE memberships_new;. The analyze took 142 seconds. The graph flattened in real time.

We stayed on the call until 03:48, watching the graph, then said good night the way people say good night when nothing is funny anymore. I closed the laptop. I drank the water. I sat in the kitchen for another twenty minutes and then I went to bed.

Online schema-change tooling is not magic. It is a slow, opinionated machine that wants you to drain pools, read your crons, and set your statistics. Skip any one of those and it will, eventually, find you on a Saturday night.
Idris Pak, Platform Lead · written the Monday after
§07 · LessonsWhat we would tell the team that started this in January

Seven things we will not forget twice.

Why did we estimate two days?

Because the staging cluster lied to us, and we let it. pg_basebackup restores produce a database with statistics that look right and bloat that looks normal because there is no bloat: it has never been written to in anger. Our dry-runs ran in 38 hours not because the migration was 38 hours of work but because staging was, in effect, a brand-new database. We mistook that for performance.

The fix, going forward, is simple and unfun: dry-runs must be done on a clone that has been receiving production write traffic for at least seven days. Yes, it is expensive. Yes, it is the only honest way.

Should we have used logical replication instead of lhm?

We considered it for a week. Postgres 14 logical replication into a reshaped target requires either a custom decoding plugin or a row-by-row apply via something like pgoutput plus an application-level transformer. We estimated the engineering cost at twelve days and chose lhm instead, which we estimated at three days.

If we did this again on Postgres 16, we would seriously evaluate logical replication with pglogical or the newer built-in features. Not because lhm failed — it didn't — but because logical replication gives you a built-in catch-up window that lhm has to simulate with trigger plumbing.

What was the index that did not build, really?

memberships_workspace_account_role, a three-column btree with INCLUDE on the role column. The first CREATE INDEX CONCURRENTLY ran for 34 hours and then aborted because a long-running transaction held an open snapshot. The transaction belonged to a weekly reporting job we'd been told had been deprecated in October.

The lesson is not that one cron got missed. The lesson is that we trusted a verbal answer to a question that needed a grep. After this incident, the cutover checklist gained a single new line: find every cron, in every repo, with an actual search.

What was the Sev-2 incident?

It was the planner regression on the night of March 09, described in §06. P99 latency on the auth path went from 4 ms to 612 ms for 11 minutes. We classified it as Sev-2 because the auth path is shared with checkout and we saw a 0.4% bump in 5xx responses on /checkout during the window.

No customer ticket was filed. No data was lost. The postmortem ran 9 pages.

Why did we drop the workspaces_slug_trgm index?

Because nobody used it. We added it during planning because the old tenants_v1 had a similar pg_trgm index that supported workspace search. When we audited the queries in week six, we found that the search feature itself had been moved to Elasticsearch in November 2024 and the index had been dead weight in production for fifteen months.

This is the single line item in this migration that paid for itself: −4.4 GB on disk and one fewer index for the planner to consider.

How did we know it was safe to drop tenants_v1?

We watched pg_stat_user_tables.seq_scan and idx_scan on each partition for seven days. Apac went to zero on day one. EU dropped to zero on day two. US, weirdly, kept registering 4-6 sequential scans per hour until Wednesday, which we tracked to a single cached query plan in a long-lived service that we restarted at 14:00 UTC. After that, zero across all partitions for four consecutive days, and we dropped.

If you do not have seq_scan and idx_scan on, turn them on before your next migration. They cost almost nothing and they answer the only question that matters.

If we did this again, what would change?

Three things. First: dry-run on a write-warmed clone, not a fresh restore. Second: read every cron in every repo before any CONCURRENTLY build, and write the grep into the runbook. Third: set statistics targets on the new tables to 1000 on every join column, by default, before the first cutover read.

Everything else — the tooling, the dual-write, the slow drain — we would do again, in the same order.