Databasus PITR pilot — continuous backups for the bare-metal PG18 prod

Status: PLAN (2026-06-10). Closes the backup/PITR gap flagged in 202606101230_PG18_STANDBY_SESSION_HANDOFF.md and supersedes-or-validates the pgBackRest approach in 202606072000_PGBACKREST_PITR_PLAN.md (decision below).

Why

Live prod is Dallas PG18.4 primary + Chicago PG18 standby (Kamal accessories, ZFS). Today there is no point-in-time recovery: the ZFS snapshot + the streaming standby are HA (protect against host loss), and SimpleBackups gives nightly logical dumps to Wasabi — but neither lets us rewind to an arbitrary second after a bad migration / app bug / DELETE without a WHERE. We want continuous WAL archiving + PITR.

Tool decision — Databasus (pilot), gated on a real restore

databasus/databasus — Apache-2.0, Go, self-hosted, Docker-native; PG 12–18; agent mode = physical base backup + continuous WAL archiving → PITR; built-in restore verification (spins a container, does a real restore). 7.3k★, ~55 commits/30d, last release v3.42.0 (2026-05-22), web dashboard.

vs pgBackRest (the prior plan): pgBackRest is the decade-proven standard but CLI/config-only and a bolt-on to our Docker stack. Databasus is younger (~1yr) but Docker-native (matches the Kamal accessory model), has a dashboard + notifications + automated restore verification — which is what actually keeps a small team’s backups honest. We pilot Databasus and only adopt it as backup-of-record if a real PITR restore passes. If it fails the gate, fall back to the pgBackRest plan.

Storage decision — Cloudflare R2 (not Latitude object storage)

Backups must not live on the same provider as the infra they protect. Latitude now offers S3-compatible object storage (with Object Lock/WORM) — but a Latitude account/region/billing incident could take out the DB and its backups at once. R2 is an independent provider (and the team’s go-to). Latitude OS’s Object Lock is noted as a possible future immutable second copy, not the primary.

New R2 bucket heatwave-postgres-backups-production, location ENAM (US-East, near Dallas/Chicago). ⚠️ R2 location is pinned on first create — fresh name + --location enam (see [[reference_r2_location_hint_first_create]]).
R2 S3-API token (access key id + secret) scoped to the bucket → stored in 1Password IT (op item create needs env -u OP_SERVICE_ACCOUNT_TOKEN, see [[reference_op_cli_it_vault_writes]]). Databasus S3 storage: custom endpoint https://<account>.r2.cloudflarestorage.com, region auto, the R2 key/secret.

Topology (pilot)

Chicago host 100.68.157.49 (the PG18 STANDBY box)
  ├─ heatwave-postgres  (Kamal accessory, standby, /data/postgres/data)   ← backup SOURCE
  ├─ databasus controller (Docker container, web UI :<port>, host-local/tailnet only)
  └─ databasus agent → reads the standby's PGDATA + streams WAL
        │  base backup taken from the STANDBY (offloads the Dallas primary)
        ▼
   Cloudflare R2  heatwave-postgres-backups-production (ENAM)   ← OFF Latitude

Backing up the standby keeps pg_basebackup load off the live primary; the standby already has every WAL segment via streaming (~0 lag).

Open technical decisions (resolve at setup, don’t pre-bake)

WAL archiving mechanism. Confirm whether Databasus archives via pg_receivewal (streaming, slot-backed — zero-gap, preferred) or archive_command. If pg_receivewal: give it a dedicated replication slot; decide source = standby (full offload, relies on cascading) vs primary (most authoritative, ~1 extra stream, negligible cost). If archive_command on the standby: needs archive_mode=always + a standby restart. Acceptance requires proving no WAL gaps.
Backup DB role. Create a dedicated least-privilege databasus role with REPLICATION (don’t reuse replication/deploy); agent reads the standby container’s PGDATA (Databasus “supports Docker containers”).
Schedule + retention (GFS). Align with SimpleBackups’ GFS (7 daily / 5 weekly / 12 monthly): daily physical base + continuous WAL; retention enforced by Databasus.
Controller placement. Pilot runs the controller on the Chicago box (simple). It’s DR-safe because backups + restore are reproducible from R2 alone (redeploy controller anywhere) — but for prod, consider relocating the controller off the DB host (e.g. a small separate box).
Bucket immutability. Pilot bucket = standard (Databasus retention needs delete). Object Lock (create-time only) is a later hardening decision — would need a fresh bucket; weigh against GFS rotation.

Acceptance gate — a REAL PITR restore (the whole point)

Run base + WAL for ~2 days; generate known write activity with timestamps.
Pick a target second T; restore the base + replay WAL to T into a throwaway container (not prod).
Verify: cluster starts, recovery reaches T, and a sentinel row written just before T is present while one written just after T is absent. Row counts/checksums match expectation for T.
Run Databasus’s built-in restore verification on a schedule too.
Only after a clean pass → Databasus becomes backup-of-record; keep SimpleBackups (Wasabi, logical) running in parallel as belt-and-suspenders for ≥1 cycle, then decide.

Deployment progress — 2026-06-10

Infra stood up (headless):

✅ R2 bucket heatwave-postgres-backups-production — location ENAM, Standard (CF REST API; the R2-scoped token 9106’d on /memberships, so used account_api_token from op://IT/Cloudflare API Token - Heatwave - All Zones).
✅ Databasus controller — docker run databasus/databasus:latest on Chicago 100.68.157.49, bound 100.68.157.49:4005 (tailnet-only, not public), data /data/databasus-data, attached to the kamal network (172.18.0.4). HTTP 200. Bundles PG client tools for 12–18 (PG18 ✓). Config is a dashboard SPA (the /api/* paths all serve the SPA shell — no cleanly-scriptable config API; backup config is intentionally done via the UI).
✅ Backup role databasus (LOGIN REPLICATION + pg_read_all_data + pg_monitor) created on the Dallas primary → replicated to the standby. Creds in op://IT/Databasus backup role (postgres).
✅ Connectivity verified: databasus → heatwave-postgres:5432 accepting; standby pg_hba has host all all 0.0.0.0/0 scram ⇒ logical backup as databasus works now. Physical/WAL needs one added rule host replication databasus 172.18.0.0/16 scram in the standby’s /data/postgres/data/pg_hba.conf + reload (phase 2).

Phase 1 PROVEN 2026-06-10 — full pipeline configured + a logical backup running to R2. Done end-to-end via Databasus’s (undocumented, SPA-backed) /api/v1/* API after admin registration:

R2 token: bucket-scoped object-RW token minted by script/setup_r2_postgres_backups_token.sh (the frontend-assets recipe: CLOUDFLARE_ACCOUNT_API_TOKEN from .env → POST /accounts/{acct}/tokens w/ R2 read+write perm groups scoped to the bucket → AccessKeyID = token id, Secret = sha256(token value)). Stored op://IT/R2-postgres-backups. (Newly minted R2 tokens take ~30s to propagate — first direct-test 401’d, then 200.)
Workspace Heatwave (c5ab3ebb…). Storage R2 ENAM (24730e5a…): type:"S3", s3Storage:{s3Endpoint,s3Bucket,s3Region:"auto",s3AccessKey,s3SecretKey,s3UseVirtualHostedStyle:false} → /storages/direct-test 200. Source (4c468fe8…): type:"POSTGRES", postgresql:{host:"heatwave-postgres",port:5432,username:"databasus",password,database:"heatwave",sslMode:"disable",cpuCount:1} → connection successful, PG18 auto-detected.
Backup config: format:"PG_DUMP" (logical; default), storage:{id:…} (⚠️ NOT storageId/storageIds — those silently store null → “Backup config storage ID is nil”), GFS retention 7d/4w/12m/1y. Backup COMPLETED 2026-06-11 08:46 UTC (~9 min): the 73 GB heatwave DB → a ~7.5 GB compressed logical dump + a .metadata object now in r2://heatwave-postgres-backups-production, taken off the standby (zero primary load). ✅ Full Databasus → R2 pipeline proven end-to-end.
~~Schedule cron-every-minute concern~~ was a MISREAD — the every-minute log lines were the failed-retry of the nil-storage backup, not the cron. The UI shows 0 3 * * * → a correct daily “Next run”. Encryption is a simple flag (ENCRYPTED/NONE) — Databasus manages the AES-256-GCM key internally, no passphrase, so it’s settable via API.

Phase 1 FINAL state — two backup configs live + enabled + encrypted (2026-06-11):

heatwave (4c468fe8…): daily 0 3 * * * UTC, GFS D7/W5/M12, ENCRYPTED → R2. First scheduled run Fri 06-12 03:00 UTC. (One manual test dump from 06-11 is in R2 UNencrypted — ages out via the D7 tier.)
heatwave_versions (d98fb103…, 175 GB): weekly 0 4 * * 0 UTC (Sun), GFS W5/M12, ENCRYPTED → R2. First run Sun 06-14 04:00 UTC.
Both sources = the Chicago standby (databasus role), so zero load on the Dallas primary.

🔑 Encryption key custody (CRITICAL for DR). Databasus encrypts with an auto-managed AES-256-GCM key at /data/databasus-data/secret.key (no passphrase; the encryption field is just ENCRYPTED/NONE). Lose that key → the encrypted R2 backups are unrecoverable — which would defeat off-site backup exactly when needed (Chicago box loss). The vendor explicitly says to copy the secret key off-box. DONE 2026-06-11: secret.key backed up (base64) to op://IT/Databasus-encryption-key (+ instance_id, restore notes). Restore: on a fresh Databasus, base64 -d it back to /databasus-data/secret.key, re-add the R2 storage, restore. Fuller DR (TODO): also back up all of /data/databasus-data (its embedded pgdata metadata = storage configs + backup records + key) — makes cross-instance restore turnkey; could be a small recurring job (even → R2).

Dev restore integration (decommission SimpleBackups). bin/restore (moved from script/db_restore.sh; also reachable via bin/setup --restore-db) now defaults to BACKUP_SOURCE=databasus (Wasabi/SimpleBackups kept as a fallback: BACKUP_SOURCE=wasabi). Key finding: Databasus serves a DECRYPTED pg_dump CUSTOM-format (-Fc) on download (confirmed on an actually-encrypted file) — encryption is at-rest in R2 only, so the dev never touches secret.key, and the dump feeds straight into the existing pg_restore fast/deferred path (no gunzip). Fetch = signin → list COMPLETED → download-token → GET /file. Creds = a read-only restore user op://IT/Databasus-restore-user (system MEMBER + WORKSPACE_VIEWER; list+download only — DELETE/modify verified denied). Reaches 100.68.157.49:4005 over the tailnet. Decommission SimpleBackups once a dev confirms a restore from this path.

Unencrypted test dump: deleted 06-11 (replaced by an encrypted manual run + the encrypted versions backup already in R2). Going-forward backups are all ENCRYPTED.

Phase 2 (the actual goal — PITR):

Add the databasus replication pg_hba rule on the standby + reload.
Configure incremental (physical base + continuous WAL; formats seen: WAL, pg_basebackup); confirm gap-free WAL.
Run ~2 days with seeded, timestamped writes → PITR restore test to a chosen second → verify. Acceptance gate.
If pass: backup-of-record; decide SimpleBackups’ fate. (W3) source follows the standby.

Access: admin op://IT/Databasus (Postgres Backup); UI http://100.68.157.49:4005 (tailnet-only). API: POST /api/v1/users/signin {email,password} → JWT bearer.

Risks

Databasus PITR maturity (~1yr) — mitigated by the restore-test gate; pgBackRest is the documented fallback.
WAL-from-standby completeness — proven by the restore test (no gaps).
One more service to operate — offset by the dashboard + auto restore-verification.