Skip to content

Postgres restore & recovery runbook

Three distinct restore paths. Pick by scenario — do not conflate them. (This separation is the whole point: the cluster PITR is for DR, bin/restore is for dev, bin/recovery is for node loss.)

ScenarioToolWhat it does
One DB node lost (host/disk dies), the other is healthybin/recoveryRebuilds the dead node from the live peer (pg_basebackup, cross-DC ~620 MB/s) and/or flips the write VIP. HA path — no data loss, fastest.
Whole cluster lost (both Dallas and Chicago) or a point-in-time rewind is needed (bad migration, DELETE without WHERE)Databasus PITR restore (§2)Restores the physical base + replays WAL from Cloudflare R2 to any chosen second. The DR backup-of-record.
A dev needs a recent prod DB locallybin/restoreDownloads a recent logical pg_dump (heatwave / versions-recent) from Databasus→R2 into local Docker. Not a prod DR tool.

1. bin/recovery — node rebuild / VIP flip (HA, not DR)

Section titled “1. bin/recovery — node rebuild / VIP flip (HA, not DR)”

One node is gone, the other is serving. See bin/recovery and memory project_haproxy_routing_layer / project_db_tier_merge_and_pitr.

  • bin/recovery <env> topology — show primary / standby + which node owns the write VIP.
  • bin/recovery <env> flip-db — promote the standby and reroute the write VIP (per-node pg-health drives the VIP; no app redeploy).
  • bin/recovery <env> rebuild-standby — rebuild a node from the live peer; it zfs snapshots the demoted dataset before wiping it (rollback net), and restarts the rebuilt node’s pgbouncer.

This path has no data loss (the peer is current) and is always preferred when a peer survives.


Lost the Databasus CONTROLLER / its host (not the backup data)? The backup config (admin / workspace / R2 storage / sources / schedules / restore user) lives only in the controller’s embedded metadata DB, which a host wipe destroys — only secret.key is in 1Password. Rebuild it with script/setup_databasus.sh (idempotent, one command) or restore the nightly encrypted metadata snapshot — see config/databasus/README.md. The R2 backup data + secret.key are what the data restore below needs.

When: both nodes are gone, or you must rewind the whole cluster to a point in time. Whole-cluster, PG18-locked, all-or-nothing. For recovering a few rows from recent history, restore to a throwaway at the target time and extract — do not rewind prod wholesale.

Where: the controller is a docker container (databasus) on the Chicago box (100.68.157.49:4005, tailnet-only). v3.44 is agentless — the controller itself drives pg_basebackup + pg_receivewal against the Dallas primary over the replication protocol (there is no databasus-agent anymore). Backups live in R2 heatwave-postgres-backups-production (ENAM), encrypted at rest with the AES-256-GCM key at /data/databasus-data/secret.key.

🔑 On a fresh Databasus box, restore the key FIRST. base64 -d the key from op://IT/Databasus-encryption-key back to /databasus-data/secret.key and re-add the R2 storage, then restore — otherwise the encrypted R2 backups are unrecoverable.

Procedure — restore to a target time T:

The agentless restore is wrapped by bin/restore → ”🧱 Physical full-cluster / point-in-time restore (PITR)”. It asks the controller for a one-time restore token, then runs config/databasus/databasus-recovery.sh to download the base + incrementals, pg_combinebackup them, replay WAL to the target, and boot the result as a throwaway PG18 container on localhost:5544. That throwaway IS the verification sandbox — never point recovery at prod.

  1. Pick the target second T as RFC3339 (UTC), e.g. 2026-06-13T09:53:26Z, or take the latest.
  2. Run bin/restore, choose the physical/PITR option, and answer the point-in-time prompt with T (or decline for the latest). It downloads → pg_combinebackup → WAL-replays to T → boots heatwave-restore-physical on localhost:5544 (~30s to replay and promote). Under the hood: POST /api/v1/backups/physical/database/<phys-dbid>/restore-tokendatabasus-recovery.sh --combine-image ghcr.io/warmlyyours/heatwave-postgres:18-noble [--target-time <T>] <token-url> <outdir>. databasus-recovery.sh can also be run directly on the Chicago box for an on-host restore (it reconstructs PGDATA under <out>/<major>/docker for PG≥18, <out>/data for PG≤17).
  3. Confirm recovery reached T (the acceptance gate that passed 2026-06-12): connect with psql -h localhost -p 5544 -U postgres, check a sentinel written before T is present and one after T is absent, and that expected row counts / checksums match.
  4. Promote into prod (the high-stakes step — only after §3 verifies): bring the kamal postgres accessory down, swap the verified restored cluster into place, bring it up as the new primary, then bin/recovery <env> rebuild-standby to re-seed the peer. Let the Databasus controller resume its WAL stream and take a fresh base. Treat this like the prod cutover it is.

Gotchas (don’t re-discover):

  • Admin login is admin@warmlyyours.com — v3.44 enforces a real email format, so the old literal admin now 500s on signin (it locks you out of the UI/API if you use the old name).
  • Controller API signin: POST /api/v1/users/signin {"email":"admin@warmlyyours.com","password":…}, creds op://IT/Databasus (Postgres Backup) (use item id nibgfgtpmkgb7sjeafup5jlqem — the op:// path with the parenthesized title fails). GET /api/v1/databases?workspace_id=… lists sources; physical restore tokens come from POST /api/v1/backups/physical/database/<dbid>/restore-token.
  • The controller logs failed to check database health: unsupported database type: POSTGRES_PHYSICAL every minute — harmless (the healthcheck feature doesn’t understand physical sources; backups + WAL streaming work regardless).
  • The prod maintenance window (bin/maintenance) stops/starts the databasus container now, not a systemd agent.

Physical source: “Cluster PITR (physical)” → Dallas primary, FULL_INCREMENTAL_WAL_STREAM, DBID df66dc63-f346-4f0b-80d1-fb87ef69ee2e (also hardcoded in bin/restore as DATABASUS_PHYS_DBID).

Full proof + the tablespace-consolidation prelude: doc/tasks/202606120650_CHICAGO_PITR_HANDOFF.md; pilot + acceptance gate: doc/tasks/202606101600_DATABASUS_PITR_PILOT.md; the agent→agentless migration (v3.42→v3.44 fork, this model): doc/tasks/202606180300_DATABASUS_AGENTLESS_MIGRATION.md (PR #1200).


3. bin/restore — dev logical restore (not DR)

Section titled “3. bin/restore — dev logical restore (not DR)”

bin/restore (or bin/setup --restore-db) downloads a recent logical dump from Databasus→R2 (default BACKUP_SOURCE=databasus; BACKUP_SOURCE=wasabi is the legacy fallback) and restores into the local Docker PG with the fast / deferred-table / matview path (skips ~18 GB of visits/communications/audits on the fast path).

Logical sources (re-established 2026-06-13 — the originals were deleted when Cluster-PITR became backup-of-record, which silently broke bin/restore; PITR covers DR, these cover dev):

  • heatwave — daily 03:00 UTC logical pg_dump, GFS D7/W5/M12, ENCRYPTED, off the Chicago standby (heatwave-postgres-replica). DBID 4aa19809-a3eb-4bdc-b3c7-7f3902a6e53f.
  • heatwave_versions (recent partition) — daily 04:00 UTC, current partition only (versions_2026; older partitions excluded server-side → ~16 GB not ~105 GB), GFS D3, ENCRYPTED. DBID 006ecd23-62a5-4f75-b3b4-addf54324577. Restoring it gives current-year audit history with the partition structure — which also dodges the empty-partition write-500 gotcha of db/versions_structure.sql (memory project_versions_db_partition_restore_gotcha).

Annual upkeep: when versions_2027 is created, add versions_2026 to that source’s excludeTables (tracks the bin/versions-partitions archive cadence) so the dev dump stays ~one partition. See doc/tasks/202606130916_VERSIONS_PARTITION_ARCHIVE_AND_RESTORE_STRATEGY.md.

Creds: read-only restore user op://IT/Databasus-restore-user (list + download only). Encryption is at-rest only — Databasus serves a DECRYPTED pg_dump custom-format file on download, so the dev never handles secret.key and it feeds straight into pg_restore.