Postgres restore & recovery runbook
Three distinct restore paths. Pick by scenario — do not conflate them. (This separation is
the whole point: the cluster PITR is for DR, bin/restore is for dev, bin/recovery is for node loss.)
| Scenario | Tool | What it does |
|---|---|---|
| One DB node lost (host/disk dies), the other is healthy | bin/recovery | Rebuilds the dead node from the live peer (pg_basebackup, cross-DC ~620 MB/s) and/or flips the write VIP. HA path — no data loss, fastest. |
Whole cluster lost (both Dallas and Chicago) or a point-in-time rewind is needed (bad migration, DELETE without WHERE) | Databasus PITR restore (§2) | Restores the physical base + replays WAL from Cloudflare R2 to any chosen second. The DR backup-of-record. |
| A dev needs a recent prod DB locally | bin/restore | Downloads a recent logical pg_dump (heatwave / versions-recent) from Databasus→R2 into local Docker. Not a prod DR tool. |
1. bin/recovery — node rebuild / VIP flip (HA, not DR)
Section titled “1. bin/recovery — node rebuild / VIP flip (HA, not DR)”One node is gone, the other is serving. See bin/recovery and memory
project_haproxy_routing_layer / project_db_tier_merge_and_pitr.
bin/recovery <env> topology— show primary / standby + which node owns the write VIP.bin/recovery <env> flip-db— promote the standby and reroute the write VIP (per-node pg-health drives the VIP; no app redeploy).bin/recovery <env> rebuild-standby— rebuild a node from the live peer; itzfs snapshots the demoted dataset before wiping it (rollback net), and restarts the rebuilt node’s pgbouncer.
This path has no data loss (the peer is current) and is always preferred when a peer survives.
2. Databasus PITR restore — the DR path
Section titled “2. Databasus PITR restore — the DR path”Lost the Databasus CONTROLLER / its host (not the backup data)? The backup config (admin / workspace / R2 storage / sources / schedules / restore user) lives only in the controller’s embedded metadata DB, which a host wipe destroys — only
secret.keyis in 1Password. Rebuild it withscript/setup_databasus.sh(idempotent, one command) or restore the nightly encrypted metadata snapshot — seeconfig/databasus/README.md. The R2 backup data +secret.keyare what the data restore below needs.
When: both nodes are gone, or you must rewind the whole cluster to a point in time. Whole-cluster, PG18-locked, all-or-nothing. For recovering a few rows from recent history, restore to a throwaway at the target time and extract — do not rewind prod wholesale.
Where: the controller is a docker container (databasus) on the Chicago box
(100.68.157.49:4005, tailnet-only). v3.44 is agentless — the controller itself drives
pg_basebackup + pg_receivewal against the Dallas primary over the replication protocol (there is
no databasus-agent anymore). Backups live in R2 heatwave-postgres-backups-production (ENAM),
encrypted at rest with the AES-256-GCM key at /data/databasus-data/secret.key.
🔑 On a fresh Databasus box, restore the key FIRST.
base64 -dthe key fromop://IT/Databasus-encryption-keyback to/databasus-data/secret.keyand re-add the R2 storage, then restore — otherwise the encrypted R2 backups are unrecoverable.
Procedure — restore to a target time T:
The agentless restore is wrapped by bin/restore → ”🧱 Physical full-cluster / point-in-time
restore (PITR)”. It asks the controller for a one-time restore token, then runs
config/databasus/databasus-recovery.sh to
download the base + incrementals, pg_combinebackup them, replay WAL to the target, and boot the
result as a throwaway PG18 container on localhost:5544. That throwaway IS the verification
sandbox — never point recovery at prod.
- Pick the target second
Tas RFC3339 (UTC), e.g.2026-06-13T09:53:26Z, or take the latest. - Run
bin/restore, choose the physical/PITR option, and answer the point-in-time prompt withT(or decline for the latest). It downloads →pg_combinebackup→ WAL-replays toT→ bootsheatwave-restore-physicalonlocalhost:5544(~30s to replay and promote). Under the hood:POST /api/v1/backups/physical/database/<phys-dbid>/restore-token→databasus-recovery.sh --combine-image ghcr.io/warmlyyours/heatwave-postgres:18-noble [--target-time <T>] <token-url> <outdir>.databasus-recovery.shcan also be run directly on the Chicago box for an on-host restore (it reconstructs PGDATA under<out>/<major>/dockerfor PG≥18,<out>/datafor PG≤17). - Confirm recovery reached
T(the acceptance gate that passed 2026-06-12): connect withpsql -h localhost -p 5544 -U postgres, check a sentinel written beforeTis present and one afterTis absent, and that expected row counts / checksums match. - Promote into prod (the high-stakes step — only after §3 verifies): bring the kamal
postgresaccessory down, swap the verified restored cluster into place, bring it up as the new primary, thenbin/recovery <env> rebuild-standbyto re-seed the peer. Let the Databasus controller resume its WAL stream and take a fresh base. Treat this like the prod cutover it is.
Gotchas (don’t re-discover):
- Admin login is
admin@warmlyyours.com— v3.44 enforces a real email format, so the old literaladminnow 500s on signin (it locks you out of the UI/API if you use the old name). - Controller API signin:
POST /api/v1/users/signin {"email":"admin@warmlyyours.com","password":…}, credsop://IT/Databasus (Postgres Backup)(use item idnibgfgtpmkgb7sjeafup5jlqem— theop://path with the parenthesized title fails).GET /api/v1/databases?workspace_id=…lists sources; physical restore tokens come fromPOST /api/v1/backups/physical/database/<dbid>/restore-token. - The controller logs
failed to check database health: unsupported database type: POSTGRES_PHYSICALevery minute — harmless (the healthcheck feature doesn’t understand physical sources; backups + WAL streaming work regardless). - The prod maintenance window (
bin/maintenance) stops/starts thedatabasuscontainer now, not a systemd agent.
Physical source: “Cluster PITR (physical)” → Dallas primary, FULL_INCREMENTAL_WAL_STREAM,
DBID df66dc63-f346-4f0b-80d1-fb87ef69ee2e (also hardcoded in bin/restore as DATABASUS_PHYS_DBID).
Full proof + the tablespace-consolidation prelude: doc/tasks/202606120650_CHICAGO_PITR_HANDOFF.md;
pilot + acceptance gate: doc/tasks/202606101600_DATABASUS_PITR_PILOT.md; the agent→agentless
migration (v3.42→v3.44 fork, this model): doc/tasks/202606180300_DATABASUS_AGENTLESS_MIGRATION.md
(PR #1200).
3. bin/restore — dev logical restore (not DR)
Section titled “3. bin/restore — dev logical restore (not DR)”bin/restore (or bin/setup --restore-db) downloads a recent logical dump from Databasus→R2
(default BACKUP_SOURCE=databasus; BACKUP_SOURCE=wasabi is the legacy fallback) and restores into
the local Docker PG with the fast / deferred-table / matview path (skips ~18 GB of
visits/communications/audits on the fast path).
Logical sources (re-established 2026-06-13 — the originals were deleted when Cluster-PITR became
backup-of-record, which silently broke bin/restore; PITR covers DR, these cover dev):
- heatwave — daily 03:00 UTC logical
pg_dump, GFS D7/W5/M12, ENCRYPTED, off the Chicago standby (heatwave-postgres-replica). DBID4aa19809-a3eb-4bdc-b3c7-7f3902a6e53f. - heatwave_versions (recent partition) — daily 04:00 UTC, current partition only
(
versions_2026; older partitions excluded server-side → ~16 GB not ~105 GB), GFS D3, ENCRYPTED. DBID006ecd23-62a5-4f75-b3b4-addf54324577. Restoring it gives current-year audit history with the partition structure — which also dodges the empty-partition write-500 gotcha ofdb/versions_structure.sql(memoryproject_versions_db_partition_restore_gotcha).
Annual upkeep: when
versions_2027is created, addversions_2026to that source’sexcludeTables(tracks thebin/versions-partitionsarchive cadence) so the dev dump stays ~one partition. Seedoc/tasks/202606130916_VERSIONS_PARTITION_ARCHIVE_AND_RESTORE_STRATEGY.md.
Creds: read-only restore user op://IT/Databasus-restore-user (list + download only). Encryption is
at-rest only — Databasus serves a DECRYPTED pg_dump custom-format file on download, so the dev
never handles secret.key and it feeds straight into pg_restore.