PG16 → 18 production migration — the Dallas↔Chicago ping-pong runbook

Refines the 12-step plan into an executable runbook. Supersedes 202606061949_PG18_UPGRADE_STRATEGY.md §7b (the single-box --copy, no-standby approach) — this version keeps a live streaming standby as the rollback and lands both boxes on ZFS + PG18 with a repointable connection layer in front.

⏱️ STATUS 2026-06-10 — W1 + Phase D + W2 (Dallas→PG18) + Chicago standby + RO VIP DONE; NEXT = W3

W2 EXECUTED 2026-06-10 — prod is now PostgreSQL 18.4 on Dallas. In-place pg_upgrade --link (the upgrade itself ~12 s) behind a kamal maintenance window (kamal app maintenance → … → kamal app live; ~25 min wall, checksums-dominated + the gotchas below). Sequence run as gated server-side scripts (/tmp/w2_*.sh): clean fast-shutdown (+pg_controldata “shut down” gate) → ZFS snapshot tank/prod-replica@pre-pg18 (the rollback) → pg_checksums --enable (32.5 M blocks, ~5 min; prod was checksums-off) → initdb PG18 (noble, glibc 2.39, matching) → pg_upgrade --check + --link → re-seed postgresql.conf + pg_hba.conf (NOT auto.conf — it held dead replication creds incl. plaintext passwords, deliberately dropped) → swap dirs (data16_pre18 kept) → kamal accessory reboot postgres to :18-noble → copy server.crt/server.key → ALTER EXTENSION … UPDATE → vacuumdb --analyze-in-stages. Verified: PG18.4, checksums on, listen=*, preload intact; row counts ≥ baseline (no loss — both heatwave + heatwave_versions/tablespace carried in one pass); FDW loopback ok; 3 ledger triggers; all 5 prod hostnames 200; no new AppSignal errors.

5 gotchas hit + fixed (read before W3 / any future pg_upgrade): (1) install -o 999 rejects the numeric uid (no host user 999) → mkdir+chown 999:999. (2) --link cross-device: three separate -v bind mounts (data/data18/tbs) are three st_devs inside the container even on one ZFS dataset → mount the parent once at /var/lib/postgresql + ln -sfn …/tbs /mnt/postgresql. (3) the old cluster’s postgresql.conf hardcodes data_directory='/var/lib/postgresql/data' → the upgrade container must mount the data at exactly that path or the source server won’t start. (4) a leftover heatwave_test DB in the prod cluster still had adminpack (the pre-flight migration only cleaned heatwave/heatwave_versions) → --check fatal until dropped. (5) pg_upgrade carries only relations, not loose files → copy server.crt/server.key into the new dir or PG18 crash-loops on ssl=on.

Rollback still available (keep ~1 day): sudo zfs rollback tank/prod-replica@pre-pg18 → revert accessory image to :16-noble. Old cluster also at /data/prod-replica/data16_pre18.

Phase D DONE 2026-06-10. Chicago reset via Terraform Cloud to Ubuntu 26.04 + ZFS (tank) + BBR + Tailscale-GRO, idle; node renamed to canonical chi-latitude-heatwave-02 = 100.68.157.49 (old node + all Vultr/ash legacy purged from the tailnet). Edge firewall imported + applied (TFC workspace green, 8 resources). Public IP unchanged (186.233.186.45).

CHICAGO PG18 STANDBY DONE 2026-06-10. pg_basebackup Dallas→Chicago over the tailnet (host=100.123.47.52 sslmode=require, slot chicago_standby, 84 GB into tank/postgres @ recordsize 128K→16K) → booted as a kamal accessory (kamal accessory boot postgres -d production; deploy.production.yml repointed to fresh Chicago 100.68.157.49 + :18-noble + /data/postgres/data). Streaming, async, lag ~0 (pg_stat_replication shows chicago_standby streaming; slot active). The @pre-pg18 snapshot was dropped (the live standby is now the rollback; 84 GB reclaimed). End state so far: Dallas PG18.4 primary + Chicago PG18 standby, both ZFS, PgBouncer fronting Dallas.

NEXT = W3 (move prod back to the fresh ZFS+PG18 Chicago 100.68.157.49; flip .kamal/prod-active-destination → production; quiesce → Chicago lag 0 → pg_promote() → PgBouncer backend → Chicago → tunnel home → rebuild Dallas as the PG18 replica).

READ-ONLY-TO-STANDBY VIP DONE 2026-06-10. heatwave-db-ro Tailscale VIP → 100.92.175.80:6432 → a pgbouncer accessory on Chicago (kamal accessory boot pgbouncer -d production, bound 100.68.157.49:6432, host files in /data/pgbouncer/) → the local PG18 standby, so direct read_only SQL users hit the standby (pg_is_in_recovery()=true, writes rejected) and offload the Dallas primary. RW stays on heatwave-db → 100.125.93.206:6432 → Dallas primary. Advertised by tailscale serve --service=svc:heatwave-db-ro --tcp=6432 tcp://100.68.157.49:6432 on Chicago (node tagged tag:heatwave-db; auto-approved by autoApprovers). The VIP was created via the admin API; a config-driven import {} block in infra/terraform/tailscale/main.tf reconciles it into TFC state on the next master apply (then delete the block). At W3 the RO VIP follows the standby — re-run the same serve --service on the new standby box, tailscale serve clear svc:heatwave-db-ro on the old. See 202606101230_PG18_STANDBY_SESSION_HANDOFF.md Task 9 for the exact mechanism + the land step.

Post-W2 cleanup DONE 2026-06-10: dropped heatwave_test; pg_repack→1.5.3 (DROP+CREATE); removed 2 W1-era exited containers + data16_pre18 + 2 redundant snapshots (tank/postgres@pre-pg18 staging, tank/prod-replica@pre-chicago-reset) + 2 bookworm PG16 images; swept 13 docs/skills to PG18. Snapshot tank/prod-replica@pre-pg18 (now ~84 GB — it became the sole holder of the pre-upgrade cluster once data16_pre18 was deleted): KEEP as the W2 rollback until the Chicago PG18 standby is streaming, then sudo zfs destroy it. ZFS snapshots add no query-perf cost (they only hold space via COW + grow as the live cluster diverges). /mnt/postgresql tablespace: DECIDED to LEAVE 2026-06-10. It can’t be dropped online — the 175 GB versions data is explicitly in pg_default while the DB default is the near-empty heatwave_versions tablespace (8.8 MB of catalogs), so ALTER DATABASE … SET TABLESPACE pg_default refuses (“some relations are already in pg_default”). Removal recipe (future planned op): pg_dump -Fc heatwave_versions → quiesce PaperTrail/audit writes (or dual-write) → DROP DATABASE heatwave_versions → CREATE DATABASE heatwave_versions (no TABLESPACE clause → pg_default) → DROP TABLESPACE heatwave_versions → pg_restore → drop the /mnt/postgresql mount from the accessory config (~hours, 175 GB). Still open: rotate the replication creds that were cleartext in the dropped auto.conf. PG17/18 leverage backlog: doc/tasks/202606101100_PG18_LEVERAGE_OPPORTUNITIES.md.

(superseded) original Phase-D plan

Where everything actually is right now:

Production: LIVE on Dallas (dal-latitude-heatwave-01, 100.123.47.52) — PG16 promoted primary on /data/prod-replica/data (ZFS tank), fronted by PgBouncer. The team deploys normally: bin/deploy production is an alias to the live destination (.kamal/prod-active-destination = production-dallas); prod is on the latest master.
Staging coexists on the same Dallas box as service heatwave-staging.
Call-records MIGRATED to Dallas — heatwave-sftp + /data/callrecords run on Dallas; the PBX uploads there now. (The old Chicago heatwave-sftp is a dead leftover.)
Reverse standby: Chicago streams from Dallas over the tailnet (chicago-prod-standby :5433, slot chicago_standby) — prod is currently not solo.
Chicago holds only disposable things now: the frozen pre-cutover primary (heatwave-postgres :5432, kept as cutover insurance), the stale heatwave-sftp, pg-public-forward (socat), kamal-proxy, and the reverse standby. Nothing UNIQUE.
Chicago today: Ubuntu 24.04, ext4/md-RAID1, NO ZFS — the exact thing this ping-pong exists to fix. Dallas: Ubuntu 26.04, ZFS tank.

NEXT — Phase D: reset Chicago → Ubuntu 26.04 + ZFS (match Dallas), via Terraform.

Tear down Chicago’s disposable stack (standby, frozen accessories, stale sftp, socat, proxy, buildkit). ⚠️ This destroys the reverse standby → prod runs SOLO on Dallas for the reset + re-basebackup window (~30–60 min). Take a fresh Dallas ZFS-snapshot / pg_dump to Wasabi as deep insurance first.
Reprovision Chicago via infra/terraform/latitude — hostname=chi-latitude-heatwave-02, site=CHI, operating_system=ubuntu_26_04_x64_lts (default), setup_zfs_data=true (→ setup-zfs-data.sh builds the ZFS mirror); BBR + Tailscale-GRO already baked into the cloud-init. Chicago is hand-built / not in TF state, so tofu import latitudesh_server.host <chicago_server_id> then change OS + apply (Latitude reinstall), or reinstall via the Latitude API and let cloud-init run.
Re-establish the reverse standby on the fresh ZFS Chicago — pg_basebackup from Dallas over the tailnet (~3.5 min at ~1.2 GB/s), as a proper accessory on ZFS.
Then W2 (in-place PG16→18 on Dallas + standby rsync) and W3 (move prod back to the fresh ZFS+PG18 Chicago; flip the marker .kamal/prod-active-destination → production).

Terraform-Cloud test (this reset is the test case). The latitude module tofu validates clean (OpenTofu 1.12.1) and already targets Ubuntu 26.04 + ZFS. NOT yet wired to TFC: no cloud {} block in infra/terraform/latitude, no TFC token in op/env, no local state. To run the reprovision through Terraform Cloud (VCS-driven via GitHub) we need: the TFC org + workspace name (to add the cloud {} block), a TFC token / the VCS-connected workspace, and Chicago’s Latitude server_id to import. Latitude API token = op://IT/Latitude-API/credential; Latitude project = proj_R82A0yZxgN6mM.

Desired end state

Chicago (chi-latitude-heatwave-02) = primary, PG18, ZFS (today: md-RAID1/ext4, NO ZFS).
Dallas (dal-latitude-heatwave-01) = streaming replica, PG18, ZFS (already ZFS; today runs PG18 staging — see G-DAL).
A repointable connection layer (PgBouncer) so failovers are a backend repoint, not an app change.
Minimal downtime.

Verdict

The ping-pong technique is sound and the right call given the ZFS-on-Chicago goal: Chicago can’t be reformatted to ZFS while it’s primary, so you vacate it to Dallas, reformat, and come back. The streaming standby also gives the rollback §7b lacked. But the plan as written has one window that’s secretly hours long, a couple of steps that say “stream” where they must mean “rebuild,” and one app-breaking PgBouncer detail.

The big reframe: this is THREE short windows, not one long one

The plan reads as “maintenance on at step 6, off after step 11” — but step 9 (rebuild Chicago as a PG18 replica) is a ~250 GB re-clone that takes hours. Holding maintenance across it = hours of downtime, defeating the goal.

Fix: resume traffic on Dallas-PG18 the moment step 8 boots, run the Chicago rebuild live in the background, and take a second brief window only for the final flip. Net hard downtime = three short switchovers, each seconds-to-a-minute with the connection layer + a caught-up replica:

Window	What’s down	Duration	What happens
W1	writes only (brief)	~seconds	switch primary Chicago→Dallas (plan step 3)
W2	both (the upgrade)	`pg_upgrade --link` ≈ seconds + verify/boot	Dallas PG16→18 in place (steps 6–8)
W3	writes only (brief)	~seconds	switch primary Dallas→Chicago (step 11)

Everything else (replica builds, ZFS reformat, the Chicago re-clone, PgBouncer rollout) happens live.

Critical gotchas (read before scheduling)

G1 — PgBouncer MUST run in session pooling mode, not transaction. The app uses advisory locks in 28 files (with_advisory_lock, pg_advisory*) + the Rails migration advisory lock + LISTEN/NOTIFY paths — all session-scoped. Transaction pooling hands each transaction a different backend connection → advisory locks silently break (acquired on one conn, “released”/re-checked on another), and LISTEN is lost. Session mode is safe but only multiplexes between sessions, so PgBouncer’s value here is failover indirection + connection capping, not heavy pooling. (If transaction mode is ever wanted, it’s a separate project: audit every advisory lock, set prepared_statements: false, move LISTEN/NOTIFY off PG.)
- Design: run PgBouncer as a kamal accessory next to the app (heatwave-pgbouncer), app points DATABASE_HOST at it, PgBouncer’s backend = current primary. Failover = edit PgBouncer’s backend host + RELOAD, app config untouched. Needs pools for both heatwave and heatwave_versions. Concrete sized draft (ini, auth, accessory YAML, repoint procedure) in Appendix P, sized off the consolidated 49-thread / DB_POOL=55 reality from PR #1072.
G2 — streaming replication is same-major only. A PG16 replica cannot stream from a PG18 primary. So the moment Dallas goes to PG18 (step 8), the Chicago PG16 replica is dead and step 9 is a full rebuild (pg_basebackup from PG18 Dallas), not “resume streaming.” Same for step 12 (rebuild Dallas off PG18 Chicago — pg_rewind may shortcut it if timelines allow, else basebackup). Budget a ~250 GB re-clone for steps 9 and 12.
G3 — stand up the connection layer FIRST (before W1). Put PgBouncer in front while Chicago is still the only primary, repoint the app to it once (verified, no rush). Then W1/W3 are PgBouncer backend repoints — not app redeploys mid-window. The plan has it at step 10; move it to step 0.
G4 — pre-flight is already done and replicates. The two blocker migrations (20260608004415 drop adminpack, 20260608004416 drop partition identity) were applied to prod 2026-06-08 and ride the stream to the Dallas replica, so both clusters are pre-flight-clean → pg_upgrade --check passes with no schema surgery. Still required at W2 (reuse §7b 4–7): checksums are OFF in prod → pg_checksums --enable on the stopped dir (or initdb --no-data-checksums); seed shared_preload_libraries into the new conf before pg_upgrade; re-seed postgresql.conf + pg_hba.conf (incl. listen_addresses) after.
G5 — use physical replication slots on every primary→replica link, so the primary retains WAL and a WAN hiccup doesn’t force a rebuild. Set wal_keep_size as a backstop.
G6 — connectivity: RESOLVED 2026-06-08 → native public IPs + BBR, NOT Tailscale. Tailscale’s WireGuard caps cross-DC throughput at ~4 MB/s (per-tunnel overhead) — far too slow for a 250 GB basebackup. The fix is the native Latitude public IPs (CHI 186.233.186.45 ↔ DAL 67.213.118.15, eno1), which auto-route over the Global Gateway private backbone (10 Gbps, no egress, ~21 ms) — and BBR congestion control is MANDATORY: the link is asymmetric/lossy in the CHI→DAL (primary→replica pull) direction, where cubic collapses a single stream to 4–34 MB/s; BBR restores 87–122 MB/s. Persisted in /etc/sysctl.d/99-replication-bbr.conf + tcp_bbr module on both hosts. Expose the prod primary cross-DC via a socat forwarder (pg-public-forward, public:5432 → heatwave-postgres over kamal-net) scoped to the replica by a DOCKER-USER rule (the Latitude cloud firewall is non-enforcing — don’t rely on it), with PostgreSQL TLS (sslmode=verify-ca, Chicago’s self-signed server.crt as the root). Full playbook + the throughput-debugging method: the postgres-replication skill. (Proper rebind later: a dedicated Global Gateway private VLAN drops the public exposure entirely — dashboard request, team-provisioned.)
G7 — where does the app run while Dallas is primary (W1→W3)? If web/sidekiq stay in Chicago, every query crosses the WAN (~tens of ms each) for the whole Chicago reformat+reclone span (hours) = badly degraded though “up.” Decide: (a) accept degraded for the window, or (b) also deploy web+sidekiq to Dallas for the window (kamal can) so the app sits next to its DB. (b) is strongly preferred given the span. PgBouncer-on-the- app-host means the app always talks to a local bouncer regardless.
G8 — both databases, one cluster. heatwave + heatwave_versions live in the same cluster (the FDW is loopback 127.0.0.1), so physical replication and pg_upgrade carry both together and the FDW needs no reconfig after any flip. Just give PgBouncer a pool per DB.
G9 — keep the off-box dump. The live standby is the fast rollback; a verified pg_dump -Fc of heatwave (+ optionally heatwave_versions) to Wasabi before W2 is the deep one (box-loss). Cheap insurance — keep §7b step 2.

Refined sequence (DECIDED 2026-06-08)

Reuse the proven --link recipe + post-upgrade steps from §7b (checksums, preload libs, ALTER EXTENSION vector/hypopg/pg_repack UPDATE, pgvector index check, FDW + ledger verify, rotate the deploy password, regenerate structure.sql).

Key simplification (your call #4): drop the pre-upgrade Chicago PG16 replica entirely. Fail the whole stack to Dallas, upgrade Dallas, then do all the Chicago work (ZFS + a single PG18 re-clone) at leisure while Dallas serves. Chicago is rebuilt once, directly on 18 — no double re-clone across the version boundary (G2).

Phase A — PgBouncer + Dallas PG16 standby (live, no downtime) 0. PgBouncer first (session mode, G1), in front of the Chicago primary; repoint the app’s DATABASE_HOST → PgBouncer; verify. Caps server connections so the post-#1072 deploy-overlap cliff (old+new sidekiq ≈ 220 conns > 197 usable) can’t fail a deploy. Full sized draft in Appendix P. (your steps 1 + 10, pulled forward)

Dallas PG16 replica on ZFS — a second cluster on Dallas (own volume/port, beside the PG18 staging accessory; room confirmed), on its ZFS pool; pg_basebackup from Chicago + physical slot (G5); stream. (step 2)

Phase B — move the WHOLE prod stack to Dallas (W1, brief) 2. W1: cut over to Dallas. Pre-deploy prod web+sidekiq+accessories on Dallas (kamal). Then: quiesce writes → Dallas lag = 0 → promote Dallas PG16 → PgBouncer backend = Dallas → flip the prod Cloudflare tunnel origin → Dallas → resume. The entire prod stack now runs on Dallas (PG16), so no cross-WAN queries; Chicago is idle (stale former-primary), free to wipe. (step 3)

Phase C — upgrade Dallas to PG18 (W2, the only both-down window) 3. W2: in-place upgrade Dallas. Quiesce → ZFS snapshot (tank/postgres@pre-pg18, the instant rollback) → pg_checksums --enable → pg_upgrade --link 16→18 (≈ seconds) → re-seed conf/hba → boot Dallas PG18 → post-upgrade SQL + FDW/ledger verify → smoke → resume. Dallas is now the PG18 prod primary. (steps 6–8)

Rollback: zfs rollback tank/postgres@pre-pg18 → boot PG16 → PgBouncer unchanged.

Phase D — Chicago → ZFS + PG18 replica (AT LEISURE — live, no downtime) 4. Reformat Chicago to ZFS — destroy md-RAID1/ext4 /data, zpool create … mirror nvme2n1 nvme3n1 on the raw NVMes (give ZFS the disks, don’t stack on md). (step 4) 5. Rebuild Chicago as a PG18 replica of Dallas — pg_basebackup from PG18 Dallas + slot; stream. Rebuilt once, directly on 18. Now Dallas=PG18 primary, Chicago=PG18 replica, both ZFS. (steps 5 + 9, collapsed)

Don’t dawdle here: prod is sharing the Dallas box with staging during this span.

Phase E — flip home to Chicago + restore Dallas replica (W3, brief, when ready) 6. W3: cut back to Chicago. Quiesce → Chicago lag = 0 → promote Chicago → PgBouncer → Chicago → move the prod stack + Cloudflare tunnel back to Chicago → resume. (step 11) 7. Rebuild Dallas as Chicago’s PG18 replica (pg_rewind if timelines allow, else basebackup)

slot; stream. Tear down the temp PG16 Dallas cluster (staging keeps its PG18 accessory). End state: Chicago PG18 primary + Dallas PG18 replica, both ZFS, PgBouncer fronting. (step 12)

Downtime = W1 + W2 (early, back-to-back, minutes total) and W3 (later, scheduled when Chicago’s ready). Everything else — both replica builds, the Chicago ZFS reformat — is live.

Decisions (resolved 2026-06-08)

App during the Dallas-primary span → move the WHOLE prod stack to Dallas (web + sidekiq + accessories). The prod Cloudflare tunnel makes the traffic cutover a tunnel-origin flip, so there are no cross-WAN queries while Dallas is primary. (resolves G7)
PgBouncer → yes, set up FIRST (Phase A step 0), session mode (G1), sized off PR #1072’s consolidated 49-thread / DB_POOL=55 numbers (Appendix P). It also caps the deploy-overlap connection storm #1072’s 5→55 bump introduced, so it earns its keep beyond the migration. (G1/G3)
Dallas has room for the temp PG16 cluster beside the PG18 staging accessory. (resolves G-DAL)
End state → full ZFS on both boxes, Chicago primary. The Chicago ZFS reformat + single PG18 re-clone happen at leisure once everything’s on Dallas (Phase D) — not in a downtime window (supersedes the earlier “defer ZFS / stay ext4 / one-window” alternative).

Resolved

Connectivity (G6): native public IPs over the Global Gateway + BBR (not Tailscale) — see G6 above. Live since 2026-06-08; the Dallas PG16 replica is streaming with ~0 lag over it (cloned at ~122 MB/s, ~33 min). Operational detail in the postgres-replication skill.

Rehearse first

Only the --link upgrade itself is proven (dev + staging). The full ping-pong (cross-site failover, ZFS reformat, cross-major replica rebuilds, PgBouncer cutover) is not yet rehearsed. Strongly recommend a dry run on two throwaway Latitude boxes (or dal-staging + a temp box) end-to-end before touching prod.

Appendix W1 — the move-to-Dallas cutover (Phase B), detailed

Status: ✅ EXECUTED 2026-06-09. Prod is LIVE on Dallas — quiesced Chicago (lag 0) → promoted heatwave-prod-replica → handed the data dir to the heatwave-postgres accessory (-d production-dallas) → booted pgbouncer/valkey/playwright → deployed the prod image aa27376c boot-only → flipped the prod Cloudflare tunnel (63430a0c) Chicago→Dallas. All 5 prod hostnames serve 200 over HTTPS; write path (app→heatwave-pgbouncer→promoted primary) confirmed (txid_current); 0 AppSignal exceptions post-cutover; Chicago idle-but-intact (DB kept for rollback). Coexists with staging via proxy.host (prod specific hosts) vs staging catch-all.

Two gotchas hit (fix for W3 / future first-deploys-to-a-host):

-d production-dallas had no .kamal/secrets.production-dallas → first app boot failed Secret 'production' not found. Fixed by copying .kamal/secrets.production (now committed). A new destination needs its own .kamal/secrets.<dest>.

Transient ghcr.io 502 on the manifest pull (+ a pre-pulled image getting pruned between attempts) — just retried; not a config issue.

Original draft notes below (kept for the W3 move-back, which is the mirror of this).

Post-W1 done 2026-06-09 — `bin/deploy production` alias + reverse standby

bin/deploy production is now an alias to LIVE prod. During the span, config/deploy.production.yml still points at idle Chicago, so a naive bin/deploy production would footgun (ship to Chicago + migrate the stale DB). Fixed: production resolves through the committed marker .kamal/prod-active-destination (= production-dallas now) so it always ships to the real live prod. At W3, flip the marker back to production. Tested live (prod → 1a79f63, healthy).
Reverse standby: Chicago is now a hot standby of the Dallas primary, streaming over the tailnet — pg_basebackup from host=100.123.47.52 sslmode=require into /data/postgres-standby (248 G in ~205 s ≈ 1.2 GB/s), manual chicago-prod-standby container on port 5433, slot chicago_standby, WAL backstop max_slot_wal_keep_size=200GB. NO socat / public 5432 / firewall — the Tailscale-at-line-rate payoff. The frozen pre-cutover primary (heatwave-postgres :5432) is KEPT as extra insurance.
- ⚠️ W2 implication: a PG16 standby CANNOT follow Dallas’s in-place pg_upgrade --link to PG18. W2 must therefore include the pg_upgrade standby rsync step (upgrade the primary, then rsync --hard-links the upgraded cluster to /data/postgres-standby), or simply re-basebackup Chicago as PG18 afterward (~3.5 min over the tailnet).

What runs where during the Dallas-primary span

After W1, the whole prod stack runs on Dallas (dal-latitude-heatwave-01, 100.123.47.52) against the promoted PG16 cluster (the ex-replica at /data/prod-replica/data); Chicago goes idle (stale former-primary, free to wipe for Phase D). Prod traffic reaches Dallas because the prod Cloudflare tunnel (token tunnel 63430a0c-…, today on Chicago) is moved to Dallas — the prod hostnames then resolve to Dallas’s kamal-proxy.

The complication: Dallas already runs staging (service: heatwave, its own tunnel a6702687-…, PG18 accessory). Prod is also service: heatwave → a second same-named deployment on the same host/kamal-proxy collides. → Decision #1 below.

Pre-stage BEFORE the window (all non-disruptive)

P-1. A “prod-on-Dallas” deploy config. Either a temporary edit of config/deploy.production.yml (hosts → 100.123.47.52; postgres accessory volume → /data/prod-replica/data; remove the Chicago-only sftp accessory unless the PBX is re-pointed) — reverted at W3 — or a separate deploy.production-dallas.yml. Build it, don’t deploy. MUST add proxy.host (prod hostnames, e.g. crm/www/api/scan/mcp.warmlyyours.com) — and add the matching proxy.host (crm/www/api/mcp.warmlyyours.ws) to deploy.staging.yml — so the two apps host-route instead of colliding on kamal-proxy’s catch-all (lesson #1 above). Commit it (not --skip-push on a dirty tree — lesson #3).
P-2. Dallas PgBouncer host files. The promoted Dallas DB already has the pgbouncer auth role + get_auth() + the deploy/replication roles (they rode the basebackup from Chicago), so no role bootstrap — just render /data/pgbouncer/{conf.d/databases.ini→heatwave-postgres, userlist.txt} on Dallas using the same prod pgbouncer password (1Password Heatwave-PgBouncer-production).
P-3. Prod tunnel on Dallas, staged-not-started. A second cloudflared systemd unit (cloudflared-prod) with the prod token, alongside staging’s. Don’t start it yet.
P-4. Pre-pull the prod app image on Dallas (docker pull ghcr.io/warmlyyours/heatwave:<sha>) so the W1 deploy is boot-only, shortening the window.

Staging rename execution (Decision #1a) — ✅ DONE 2026-06-09

EXECUTED 2026-06-09. Staging now runs as heatwave-staging on Dallas (all accessories + web + sidekiq), live app verified through heatwave-staging-pgbouncer → heatwave-staging-postgres, replica untouched (still streaming, 0 lag). Three gotchas surfaced that W1 (prod-on-Dallas) WILL also hit — see “W1-critical lessons” right below this procedure. The steps below are the corrected, as-run procedure (the original draft had two wrong assumptions, flagged inline).

Renames staging heatwave → heatwave-staging so prod can later coexist on the box. The repo edits are already made (Decision #1); this is the coordinated host cutover that makes them live. The hazard: the new heatwave-staging-* accessories bind the SAME host ports the running heatwave-* accessories hold, so the OLD ones must stop (freeing the ports) BEFORE the new ones boot. Data survives — every datastore is a /data/* host bind-mount the new container re-mounts; only captured mailpit mail (disposable) resets. heatwave-prod-replica (the live streaming replica, port 5433) is NOT a staging accessory — never stop/remove it during this. A brief staging outage (~2-3 min) is fine; staging has no SLA.

(old config active) Free the ports by stopping the current staging app + accessories — run with the PRE-rename checkout so kamal still resolves the heatwave-* names:
Terminal window
```
kamal app stop -d staging
kamal accessory stop postgres pgbouncer valkey playwright mailpit -d staging
```
(stop releases the host-port bindings; the stopped containers are harmless — they have the OLD names, so they won’t collide with the new ones. Remove them in step 6.)
Repoint the host pgbouncer backend on Dallas — edit /data/pgbouncer/conf.d/databases.ini, both lines host=heatwave-postgres → host=heatwave-staging-postgres (userlist.txt unchanged — same auth role + password).
Activate the renamed config — merge the rename to master (or check out the rename branch on the deploy host).

Boot the new accessories, postgres FIRST (so heatwave-staging-postgres exists for the pgbouncer databases.ini DNS to resolve):

kamal accessory boot postgres -d staging          # → heatwave-staging-postgres, same /data/postgres/pg18
kamal accessory boot pgbouncer valkey playwright mailpit -d staging

Deploy the app (user-run — bin/deploy is hard-blocked): bin/deploy staging. Builds the renamed image (mailpit address now heatwave-staging-mailpit), boots heatwave-staging-web + -sidekiq, registers with kamal-proxy (still catch-all — staging is the only app on the box until W1; proxy.host host-routing only becomes necessary when prod lands beside it, then set it on BOTH apps to the cloudflared-forwarded hostnames).
Verify + clean up — public /up via the staging tunnel (302 externally / 200 on the internal kamal-proxy healthcheck — app is up either way); a read+write through heatwave-staging-pgbouncer to both DBs; sidekiq up. Then drop the old stopped containers: docker rm heatwave-postgres heatwave-pgbouncer heatwave-valkey heatwave-mailpit heatwave-playwright (and the old heatwave-web-staging-*/heatwave-sidekiq-staging-*). Re-confirm heatwave-prod-replica is still streaming (it was: in_recovery=t, wal_receiver=streaming, lag≈0).

⚠️ W1-critical lessons (prod-on-Dallas WILL hit these)

The staging rename was the first deploy of a NEW service name onto a host already running another Kamal app + the manual replica. Three things bit, all of which recur at W1 when prod (heatwave) lands beside staging (heatwave-staging):

kamal-proxy catch-all conflict → Error: host settings conflict with another service. Two services with NO proxy.host both claim the catch-all * route; the second to deploy is rejected AFTER its app booted healthy (kamal rolls it back). Stopping the old app container does NOT free the route — kamal-proxy keeps the registration. Two fixes, and W1 needs the second:
- One-off (what we did): docker exec kamal-proxy kamal-proxy remove <old-service> (here heatwave-web-staging), then redeploy. kamal-proxy list shows the table.
- For coexistence (W1): set proxy.host on BOTH apps to their cloudflared-forwarded hostnames (prod → *.warmlyyours.com, staging → *.warmlyyours.ws) so kamal-proxy routes by Host instead of fighting over the catch-all. Stage this in the prod-on-Dallas config (P-1) AND add it to deploy.staging.yml before W1. Without it, the prod deploy throws this exact error.
First deploy of a service/host has no role env-file → migrate aborts --env-file … no such file or directory (docker 125). Kamal writes .kamal/apps/<svc>-<dest>/env/roles/*.env only during kamal deploy (the boot), but bin/deploy runs the pre-swap migrate BEFORE that. There is no kamal env push in Kamal 2. Fixed in bin/deploy: it now detects the missing-env error and DEFERS the migration to right after the boot (env present), via --reuse. W1’s prod deploy on Dallas is a first-deploy-to-host → relies on this.
--skip-push is NON-reproducible on a dirty tree. Kamal’s _uncommitted_<hash> tag comes from a fresh git stash create each invocation (new timestamp → new hash), so a later --skip-push looks for a tag that was never pushed (… not found). Commit before deploying (clean tree → version = HEAD sha, stable). For W1, land the prod-on-Dallas config on a commit first; don’t iterate via --skip-push on a dirty tree.

(Also note: .kamal/hooks/pre-deploy hardcodes --filter label=service=heatwave, so post-rename it no longer quiets the staging sidekiq — harmless, super_fetch still recovers jobs — but it’s correct for PROD. Make it service-aware if staging graceful-drain matters.)

W1 cutover sequence (downtime = steps 2→8, target ≤ ~10 min)

(pre) Resolve Decision #1 (staging pause/coexist) and #2 (rehearse).
Quiesce writes — maintenance page (kamal-proxy stop / a maint upstream) + drain Sidekiq on Chicago. Writes stop. (downtime starts)
Confirm lag = 0 — on Chicago pg_current_wal_lsn() == the replica’s replay_lsn in pg_stat_replication (and the slot retained ≈ 0).
Promote Dallas — SELECT pg_promote() on the heatwave-prod-replica container; it exits recovery onto a new timeline and becomes a read-write primary.
Hand the data dir to a kamal accessory — stop the manual heatwave-prod-replica container → kamal accessory boot postgres -d production on Dallas (same /data/prod-replica/data volume, kamal-net name heatwave-postgres). (mirrors the original Chicago cutover: manual standby → promote → stop → kamal accessory, same volume.)
Boot the rest on Dallas — kamal accessory boot pgbouncer/valkey/playwright -d production.
Deploy prod app on Dallas — bin/deploy production against the prod-on-Dallas config (DATABASE_HOST=heatwave-pgbouncer → local promoted PG). Sidekiq starts on Dallas.
Flip traffic — systemctl start cloudflared-prod on Dallas, systemctl stop cloudflared (prod) on Chicago. Prod hostnames now hit Dallas’s kamal-proxy. Drop the maintenance page. (downtime ends)
Verify — public /up 200 (now served by Dallas); a write succeeds on the Dallas primary; pg_stat_replication empty (no downstream yet); Chicago idle.

Rollback

Because writes are quiesced for the whole window, Chicago retains every committed row; the Dallas promotion adds ~no writes. So before step 8 (tunnel still on Chicago), rollback is trivial: un-quiesce Chicago, leave its tunnel, discard the Dallas promotion (re-clone later). After step 8, roll back by flipping the tunnel back to Chicago + un-quiescing Chicago (Dallas’s brief writes are the only loss — keep the window tight, and a pg_dump to Wasabi before W1 (G9) is the deep insurance). The fast forward-path is W2 (the PG18 upgrade) which has its own ZFS-snapshot rollback.

Decisions to confirm (before scheduling W1)

#1 — staging during the Dallas-primary span → RESOLVED: coexist via a distinct service name (not a pause). Kamal runs multiple apps on one host fine — a distinct service: per app, and the shared kamal-proxy routes by proxy.host (domain) to the right app’s web containers (Strzibny). So prod runs on Dallas beside staging, no pause. Chosen: (a) rename staging → heatwave-staging so prod keeps the name heatwave everywhere (Chicago or Dallas).
- The repo edits are DONE (uncommitted): deploy.staging.yml (service: heatwave-staging
  - DATABASE_HOST/_VERSIONS → heatwave-staging-pgbouncer, REDIS_HOST → heatwave-staging-valkey, PLAYWRIGHT_SERVER_URL → heatwave-staging-playwright), and config/environments/staging.rb mailpit address → heatwave-staging-mailpit. The host-rendered pgbouncer databases.ini backend (host=heatwave-staging-postgres) is an execution-time edit (below) since it lives on the Dallas box, not the repo.
- ⚠️ Committing the config alone is a landmine — a renamed bin/deploy staging boots NEW heatwave-staging-* accessories that bind the SAME host ports (127.0.0.1:5432/6432/6379, …:8025) the OLD heatwave-* accessories still hold → port conflict. The rename MUST be a coordinated stop-old → repoint → boot-new sequence (see “Staging rename execution” below).
- (Rejected) (b) service: heatwave-dallas for prod-on-Dallas — temporary + isolated, no staging change, but prod would carry a different name on Dallas vs Chicago. The prod app stack itself is VALIDATED on the real Dallas box (2026-06-08) — see #2.
#2 — rehearse: app-stack DONE, cutover-mechanics still open. The prod app stack was rehearsed on Dallas 2026-06-08 via a manual run against a ZFS clone of the promoted replica (non-destructive — real replica kept streaming): same prod image’s web + sidekiq + pgbouncer + valkey + playwright, web /up 200, read+write through pgbouncer→clone, both DBs, sidekiq + scheduler up. So the app on Dallas against a promoted PG16-on-ZFS is proven. Still to rehearse (or do live with the quiesce/tunnel-flip rollback): the promote + manual- container→kamal-accessory handoff + the prod cloudflared token flip mechanics.
#3 — sftp accessory (call-records, Chicago-only, locked to the PBX IP) → RESOLVED: leave it running on Chicago. The Chicago box stays powered through the Dallas-primary span, so the atmoz/sftp accessory + its PBX→public:2222 DOCKER-USER rule keep accepting uploads; call-record files accumulate in /data/callrecords on Chicago and the importer drains them when prod moves back to Chicago (W3). No PBX/firewall re-point, no paused ingestion. Only constraint: don’t wipe Chicago (Phase D reformat) until that backlog has been imported.
#4 — keep the Dallas-primary span SHORT. Every prereq for Phase D (Chicago ZFS reformat + the single PG18 re-clone) that can be staged before W1 shortens how long prod runs solo on the shared Dallas box and how long staging is degraded.

Appendix P — the PgBouncer accessory (concrete draft for Phase A step 0)

Sized off PR #1072 (fix/sidekiq-consolidated-db-pool, merged to master 2026-06-08 as 7e5b4e4c95), which is what makes the pool numbers below real rather than guessed.

P.1 — Why this is load-bearing now (the deploy-overlap cliff #1072 created)

The Kamal cutover consolidated the four Sidekiq processes into one (SIDEKIQ_CONSOLIDATED=1) reserving 49 worker threads (default 16 + invoicing/online_migrations/mailbox/storage 1 each + high 9 + low 10 + campaign 10). DB_POOL/RAILS_MAX_THREADS were set nowhere, so config/database.yml sized the pool at its fallback of 5 → 49 threads vs 5 connections → a top-of-hour job burst drained it (AppSignal #5951–#5961, all 10:02–10:05 on 2026-06-08). PR #1072 fixes the immediate bug with DB_POOL: "55" on the sidekiq role. PgBouncer is the structural follow-up to the side effect of that bump:

connections to →	`heatwave`	`heatwave_versions`
sidekiq ×1 (`DB_POOL=55`)	55	55
web ×4 Puma workers (pool 5 — no `DB_POOL`/`RAILS_MAX_THREADS` on the web role; only `PUMA_MAX_THREADS=3`)	20	20
web reading-role pool (`ApplicationViewRecord` → `primary_replica`; lazy, read-only views)	≤20	—
pghero / `kamal app exec` console / monitoring	~5	~2
steady-state peak	~100	~77

Steady state ≈ 150–177 < 197 usable (max_connections=200 − superuser_reserved_connections=3, config/postgres/production.conf) → fits, which is why #1072 didn’t need PgBouncer to ship. But a rolling deploy briefly runs old + new containers together. Sidekiq alone then needs 2 × (55+55) = 220 to heatwave + heatwave_versions — over 197 → the new container boots into FATAL: sorry, too many clients already → failed deploy. The pre-deploy Sidekiq quiet (TSTP) drains the old worker, but web pools overlap regardless, and the margin is now thin. This cliff did not exist at DB_POOL=5; it appeared the moment #1072 raised it to 55. PgBouncer converts that hard Postgres rejection into brief client-side queueing.

P.2 — Pooling mode: `session` (non-negotiable — G1)

Transaction mode is off the table: 28 files take session-scoped advisory locks (with_advisory_lock/pg_advisory*, no *_xact variants), app/models/liquid/order_drop.rb uses LISTEN/NOTIFY, the Rails migration advisory lock is session-scoped, and database.yml sets per-connection variables: (statement_timeout, min_messages) via SET on connect — all of which require the client to keep the same backend for its whole session.

Consequence for sizing: session mode binds one server connection to a client for the life of its session, so PgBouncer cannot multiplex at steady state — it is a failover-indirection + connection-ceiling layer here, not a connection multiplier. default_pool_size is therefore set to cover the real demand (transparent passthrough), and max_db_connections is the hard ceiling that only bites during the deploy-overlap storm, where the surplus queues (query_wait_timeout) instead of Postgres-rejecting.

Real multiplexing is possible later as a separate project: route only the read-only reading role (ApplicationViewRecord views — no advisory locks, no LISTEN/NOTIFY) through a second, transaction-mode PgBouncer port, while the writing role stays session mode. Natural once the Dallas PG18 replica exists (end state) and reads can target it. Not in this cut.

P.3 — `config/pgbouncer/production.ini` (committed; mounted via `files:`)

No secrets in this file — auth is via auth_query (SCRAM pass-through), so it’s safe to commit. The one mutable bit (backend host) is %included from a host-rendered file so a failover edits one small file, not this one.

[pgbouncer]
listen_addr = 0.0.0.0
listen_port = 6432
auth_type = scram-sha-256
auth_user = pgbouncer
auth_dbname = heatwave
auth_query = SELECT username, password FROM pgbouncer.get_auth($1)
auth_file = /etc/pgbouncer/userlist.txt

pool_mode = session                 ; G1 — advisory locks + LISTEN/NOTIFY + SET-on-connect
max_client_conn = 2000              ; client sockets are cheap; absorbs old+new container overlap
default_pool_size = 80              ; ≥ heatwave steady peak (~75) → transparent passthrough
min_pool_size = 10                  ; warm servers ready for the deploy handoff
reserve_pool_size = 10
reserve_pool_timeout = 3
max_db_connections = 90             ; HARD per-DB ceiling: 2 DBs × 90 = 180 < 197 usable
server_idle_timeout = 600           ; reap idle servers so steady state tracks the active set
server_lifetime = 3600
query_wait_timeout = 30             ; deploy-overlap clients wait ≤30s for a server, not error
ignore_startup_parameters = extra_float_digits,application_name

admin_users = pgbouncer
stats_users = pgbouncer

%include /etc/pgbouncer/conf.d/databases.ini

/data/pgbouncer/conf.d/databases.ini (host-rendered; the failover repoint edits this) — no user= so PgBouncer connects to the backend as the end user via SCRAM pass-through; the loopback FDW is untouched (G8):

[databases]
heatwave          = host=heatwave-postgres port=5432 dbname=heatwave
heatwave_versions = host=heatwave-postgres port=5432 dbname=heatwave_versions

P.4 — Postgres-side auth (one-time, no plaintext anywhere)

A low-priv pgbouncer login role + a SECURITY DEFINER lookup so the app’s deploy password never lands in PgBouncer’s files (pg_shadow is cluster-global, so this works for both DBs):

-- run once on the prod cluster (against heatwave); password rendered from 1Password
CREATE ROLE pgbouncer LOGIN PASSWORD :'pgbouncer_pw';
CREATE SCHEMA IF NOT EXISTS pgbouncer AUTHORIZATION pgbouncer;
CREATE OR REPLACE FUNCTION pgbouncer.get_auth(p_usename text)
  RETURNS TABLE (username text, password text)
  LANGUAGE sql SECURITY DEFINER SET search_path = pg_catalog AS $$
    SELECT usename::text, passwd::text FROM pg_shadow WHERE usename = p_usename;
  $$;
REVOKE ALL  ON FUNCTION pgbouncer.get_auth(text) FROM PUBLIC;
GRANT EXECUTE ON FUNCTION pgbouncer.get_auth(text) TO pgbouncer;

userlist.txt then holds the pgbouncer role’s own PLAINTEXT password (host-rendered to /data/pgbouncer/userlist.txt, mounted via volumes: not files: since Kamal files: won’t interpolate a secret). NOT a SCRAM verifier — PgBouncer must authenticate as the auth-user to the backend to run auth_query, and a stored SCRAM verifier (StoredKey) is one-way, so it can’t produce a client proof for that login (a SCRAM secret is only usable for a server login in the client pass-through case, which covers deploy, not the auth-user). It’s a low-priv role (only EXECUTE on get_auth); 0644 is fine since the container’s pgbouncer user must read it. Stash the password in 1Password for the record.

# on the DB host, using the SAME plaintext you set on the role:
#   PW=$(openssl rand -hex 24); CREATE ROLE pgbouncer LOGIN PASSWORD '$PW'; …
printf '"pgbouncer" "%s"\n' "$PW" > /data/pgbouncer/userlist.txt
chmod 644 /data/pgbouncer/userlist.txt

No app/deploy secret is stored — its verifier is fetched live by auth_query. (The pg_hba already admits the kamal docker subnet under scram-sha-256, so the new pgbouncer role needs no new host rule.)

Validated on DAL staging 2026-06-08 with the committed config below: the bouncer boots healthy (PgBouncer 1.25.2), deploy authenticates through it to both DBs via auth_query/pass-through, and a single session keeps its advisory lock + LISTEN/NOTIFY + temp table (the session-mode requirement). The committed config/pgbouncer/ tree + docker/pgbouncer.Dockerfile + the deploy.staging.yml accessory are the canonical reference; the snippets here are explanatory.

P.5 — the accessory (add to `config/deploy.production.yml`; mirror in staging)

  pgbouncer:
    image: ghcr.io/warmlyyours/heatwave-pgbouncer:1.25.2   # our own build (docker/pgbouncer.Dockerfile)
    host: 100.112.243.87
    port: "127.0.0.1:6432:6432"             # host-local only; app reaches it by kamal-net DNS (heatwave-pgbouncer:6432)
    cmd: /etc/pgbouncer/pgbouncer.ini       # entrypoint is `pgbouncer`; pass the mounted ini
    files:
      - config/pgbouncer/production.ini:/etc/pgbouncer/pgbouncer.ini
    volumes:
      - /data/pgbouncer/conf.d/databases.ini:/etc/pgbouncer/conf.d/databases.ini   # host-rendered backend (failover edits this)
      - /data/pgbouncer/userlist.txt:/etc/pgbouncer/userlist.txt                   # host-rendered plaintext auth-role password

The staging equivalent (config/deploy.staging.yml) is already wired + validated; prod mirrors it with production.ini once Phase A reaches Chicago. PgBouncer 1.25.2 (built from the upstream release tarball) carries the SCRAM (CVE-2026-6665) + auth_query search_path (CVE-2025-12819) fixes that land in our auth path — edoburu still lags at 1.25.1.

App repoint (config/deploy.production.yml env.clear) — the only app-side change, done once in Phase A while Chicago is still the sole primary:

    DATABASE_HOST: heatwave-pgbouncer
    DATABASE_HOST_VERSIONS: heatwave-pgbouncer
    DATABASE_PORT: "6432"
    DATABASE_PORT_VERSIONS: "6432"

kamal accessory boot pgbouncer -d production, repoint, bin/deploy production, verify (SHOW POOLS; on the admin DB; app /up; a write + a PaperTrail version). Staging mirrors this verbatim (same 192 GB hardware, same consolidated 49-thread process, max_connections already 200 per #1072) — pools can stay identical.

P.6 — failover repoint (W1 / W3): one admin sequence, app config untouched

psql "host=127.0.0.1 port=6432 user=pgbouncer dbname=pgbouncer" <<'SQL'
PAUSE;          -- let in-flight txns finish, hold new clients
SQL
#  → promote the destination Postgres (pg_promote()); if the bouncer moved hosts with the
#    stack (G7), its local databases.ini already points at the local heatwave-postgres —
#    otherwise edit /data/pgbouncer/conf.d/databases.ini → host=<new primary>
psql "host=127.0.0.1 port=6432 user=pgbouncer dbname=pgbouncer" <<'SQL'
RELOAD;         -- re-read databases.ini
RECONNECT;      -- drop stale server conns, reconnect to the now-promoted backend
RESUME;         -- release held clients
SQL

Because PgBouncer rides along with the stack to Dallas at W1 (G7 decision), the common case is: the destination’s co-located bouncer points at its local promoted Postgres, so databases.ini doesn’t even change — PAUSE → promote → RECONNECT → RESUME is enough to flush the read-only-recovery server conns and pick up the read-write primary.