Databasus PITR pilot — continuous backups for the bare-metal PG18 prod
Status: PLAN (2026-06-10). Closes the backup/PITR gap flagged in
202606101230_PG18_STANDBY_SESSION_HANDOFF.md and supersedes-or-validates the
pgBackRest approach in 202606072000_PGBACKREST_PITR_PLAN.md (decision below).
Live prod is Dallas PG18.4 primary + Chicago PG18 standby (Kamal accessories,
ZFS). Today there is no point-in-time recovery: the ZFS snapshot + the streaming
standby are HA (protect against host loss), and SimpleBackups gives nightly logical
dumps to Wasabi — but neither lets us rewind to an arbitrary second after a bad
migration / app bug / DELETE without a WHERE. We want continuous WAL archiving + PITR.
Tool decision — Databasus (pilot), gated on a real restore
Section titled “Tool decision — Databasus (pilot), gated on a real restore”databasus/databasus — Apache-2.0, Go, self-hosted, Docker-native; PG 12–18; agent mode = physical base backup + continuous WAL archiving → PITR; built-in restore verification (spins a container, does a real restore). 7.3k★, ~55 commits/30d, last release v3.42.0 (2026-05-22), web dashboard.
- vs pgBackRest (the prior plan): pgBackRest is the decade-proven standard but CLI/config-only and a bolt-on to our Docker stack. Databasus is younger (~1yr) but Docker-native (matches the Kamal accessory model), has a dashboard + notifications + automated restore verification — which is what actually keeps a small team’s backups honest. We pilot Databasus and only adopt it as backup-of-record if a real PITR restore passes. If it fails the gate, fall back to the pgBackRest plan.
Storage decision — Cloudflare R2 (not Latitude object storage)
Section titled “Storage decision — Cloudflare R2 (not Latitude object storage)”Backups must not live on the same provider as the infra they protect. Latitude now offers S3-compatible object storage (with Object Lock/WORM) — but a Latitude account/region/billing incident could take out the DB and its backups at once. R2 is an independent provider (and the team’s go-to). Latitude OS’s Object Lock is noted as a possible future immutable second copy, not the primary.
- New R2 bucket
heatwave-postgres-backups-production, location ENAM (US-East, near Dallas/Chicago). ⚠️ R2 location is pinned on first create — fresh name +--location enam(see [[reference_r2_location_hint_first_create]]). - R2 S3-API token (access key id + secret) scoped to the bucket → stored in 1Password
IT (
op item createneedsenv -u OP_SERVICE_ACCOUNT_TOKEN, see [[reference_op_cli_it_vault_writes]]). Databasus S3 storage: custom endpointhttps://<account>.r2.cloudflarestorage.com, regionauto, the R2 key/secret.
Topology (pilot)
Section titled “Topology (pilot)”Chicago host 100.68.157.49 (the PG18 STANDBY box) ├─ heatwave-postgres (Kamal accessory, standby, /data/postgres/data) ← backup SOURCE ├─ databasus controller (Docker container, web UI :<port>, host-local/tailnet only) └─ databasus agent → reads the standby's PGDATA + streams WAL │ base backup taken from the STANDBY (offloads the Dallas primary) ▼ Cloudflare R2 heatwave-postgres-backups-production (ENAM) ← OFF LatitudeBacking up the standby keeps pg_basebackup load off the live primary; the standby
already has every WAL segment via streaming (~0 lag).
Open technical decisions (resolve at setup, don’t pre-bake)
Section titled “Open technical decisions (resolve at setup, don’t pre-bake)”- WAL archiving mechanism. Confirm whether Databasus archives via
pg_receivewal(streaming, slot-backed — zero-gap, preferred) orarchive_command. Ifpg_receivewal: give it a dedicated replication slot; decide source = standby (full offload, relies on cascading) vs primary (most authoritative, ~1 extra stream, negligible cost). Ifarchive_commandon the standby: needsarchive_mode=always+ a standby restart. Acceptance requires proving no WAL gaps. - Backup DB role. Create a dedicated least-privilege
databasusrole withREPLICATION(don’t reusereplication/deploy); agent reads the standby container’s PGDATA (Databasus “supports Docker containers”). - Schedule + retention (GFS). Align with SimpleBackups’ GFS (7 daily / 5 weekly / 12 monthly): daily physical base + continuous WAL; retention enforced by Databasus.
- Controller placement. Pilot runs the controller on the Chicago box (simple). It’s DR-safe because backups + restore are reproducible from R2 alone (redeploy controller anywhere) — but for prod, consider relocating the controller off the DB host (e.g. a small separate box).
- Bucket immutability. Pilot bucket = standard (Databasus retention needs delete). Object Lock (create-time only) is a later hardening decision — would need a fresh bucket; weigh against GFS rotation.
Acceptance gate — a REAL PITR restore (the whole point)
Section titled “Acceptance gate — a REAL PITR restore (the whole point)”- Run base + WAL for ~2 days; generate known write activity with timestamps.
- Pick a target second
T; restore the base + replay WAL toTinto a throwaway container (not prod). - Verify: cluster starts,
recoveryreachesT, and a sentinel row written just beforeTis present while one written just afterTis absent. Row counts/checksums match expectation forT. - Run Databasus’s built-in restore verification on a schedule too.
- Only after a clean pass → Databasus becomes backup-of-record; keep SimpleBackups (Wasabi, logical) running in parallel as belt-and-suspenders for ≥1 cycle, then decide.
Deployment progress — 2026-06-10
Section titled “Deployment progress — 2026-06-10”Infra stood up (headless):
- ✅ R2 bucket
heatwave-postgres-backups-production— location ENAM, Standard (CF REST API; the R2-scoped token 9106’d on/memberships, so usedaccount_api_tokenfromop://IT/Cloudflare API Token - Heatwave - All Zones). - ✅ Databasus controller —
docker run databasus/databasus:lateston Chicago100.68.157.49, bound100.68.157.49:4005(tailnet-only, not public), data/data/databasus-data, attached to the kamal network (172.18.0.4). HTTP 200. Bundles PG client tools for 12–18 (PG18 ✓). Config is a dashboard SPA (the/api/*paths all serve the SPA shell — no cleanly-scriptable config API; backup config is intentionally done via the UI). - ✅ Backup role
databasus(LOGIN REPLICATION +pg_read_all_data+pg_monitor) created on the Dallas primary → replicated to the standby. Creds inop://IT/Databasus backup role (postgres). - ✅ Connectivity verified: databasus →
heatwave-postgres:5432accepting; standbypg_hbahashost all all 0.0.0.0/0 scram⇒ logical backup asdatabasusworks now. Physical/WAL needs one added rulehost replication databasus 172.18.0.0/16 scramin the standby’s/data/postgres/data/pg_hba.conf+ reload (phase 2).
Phase 1 PROVEN 2026-06-10 — full pipeline configured + a logical backup running to R2.
Done end-to-end via Databasus’s (undocumented, SPA-backed) /api/v1/* API after admin
registration:
- R2 token: bucket-scoped object-RW token minted by
script/setup_r2_postgres_backups_token.sh(the frontend-assets recipe:CLOUDFLARE_ACCOUNT_API_TOKENfrom.env→POST /accounts/{acct}/tokensw/ R2 read+write perm groups scoped to the bucket → AccessKeyID = token id, Secret =sha256(token value)). Storedop://IT/R2-postgres-backups. (Newly minted R2 tokens take ~30s to propagate — firstdirect-test401’d, then 200.) - Workspace
Heatwave(c5ab3ebb…). StorageR2 ENAM(24730e5a…):type:"S3",s3Storage:{s3Endpoint,s3Bucket,s3Region:"auto",s3AccessKey,s3SecretKey,s3UseVirtualHostedStyle:false}→/storages/direct-test200. Source (4c468fe8…):type:"POSTGRES",postgresql:{host:"heatwave-postgres",port:5432,username:"databasus",password,database:"heatwave",sslMode:"disable",cpuCount:1}→ connection successful, PG18 auto-detected. - Backup config:
format:"PG_DUMP"(logical; default),storage:{id:…}(⚠️ NOTstorageId/storageIds— those silently store null → “Backup config storage ID is nil”), GFS retention 7d/4w/12m/1y. Backup COMPLETED 2026-06-11 08:46 UTC (~9 min): the 73 GBheatwaveDB → a ~7.5 GB compressed logical dump + a.metadataobject now inr2://heatwave-postgres-backups-production, taken off the standby (zero primary load). ✅ Full Databasus → R2 pipeline proven end-to-end. Schedule cron-every-minute concernwas a MISREAD — the every-minute log lines were the failed-retry of the nil-storage backup, not the cron. The UI shows0 3 * * *→ a correct daily “Next run”. Encryption is a simple flag (ENCRYPTED/NONE) — Databasus manages the AES-256-GCM key internally, no passphrase, so it’s settable via API.
Phase 1 FINAL state — two backup configs live + enabled + encrypted (2026-06-11):
- heatwave (
4c468fe8…): daily0 3 * * *UTC, GFS D7/W5/M12, ENCRYPTED → R2. First scheduled run Fri 06-12 03:00 UTC. (One manual test dump from 06-11 is in R2 UNencrypted — ages out via the D7 tier.) - heatwave_versions (
d98fb103…, 175 GB): weekly0 4 * * 0UTC (Sun), GFS W5/M12, ENCRYPTED → R2. First run Sun 06-14 04:00 UTC. - Both sources = the Chicago standby (
databasusrole), so zero load on the Dallas primary.
🔑 Encryption key custody (CRITICAL for DR). Databasus encrypts with an auto-managed
AES-256-GCM key at /data/databasus-data/secret.key (no passphrase; the encryption field is
just ENCRYPTED/NONE). Lose that key → the encrypted R2 backups are unrecoverable — which
would defeat off-site backup exactly when needed (Chicago box loss). The vendor explicitly says
to copy the secret key off-box. DONE 2026-06-11: secret.key backed up (base64) to
op://IT/Databasus-encryption-key (+ instance_id, restore notes). Restore: on a fresh
Databasus, base64 -d it back to /databasus-data/secret.key, re-add the R2 storage, restore.
Fuller DR (TODO): also back up all of /data/databasus-data (its embedded pgdata metadata
= storage configs + backup records + key) — makes cross-instance restore turnkey; could be a
small recurring job (even → R2).
Dev restore integration (decommission SimpleBackups). bin/restore (moved from
script/db_restore.sh; also reachable via bin/setup --restore-db) now defaults to
BACKUP_SOURCE=databasus (Wasabi/SimpleBackups kept as a fallback: BACKUP_SOURCE=wasabi). Key
finding: Databasus serves a DECRYPTED pg_dump CUSTOM-format (-Fc) on download (confirmed on
an actually-encrypted file) — encryption is at-rest in R2 only, so the dev never touches
secret.key, and the dump feeds straight into the existing pg_restore fast/deferred path (no
gunzip). Fetch = signin → list COMPLETED → download-token → GET /file. Creds = a read-only
restore user op://IT/Databasus-restore-user (system MEMBER + WORKSPACE_VIEWER; list+download
only — DELETE/modify verified denied). Reaches 100.68.157.49:4005 over the tailnet. Decommission
SimpleBackups once a dev confirms a restore from this path.
Unencrypted test dump: deleted 06-11 (replaced by an encrypted manual run + the encrypted versions backup already in R2). Going-forward backups are all ENCRYPTED.
Phase 2 (the actual goal — PITR):
- Add the
databasusreplicationpg_hbarule on the standby + reload. - Configure incremental (physical base + continuous WAL; formats seen:
WAL,pg_basebackup); confirm gap-free WAL. - Run ~2 days with seeded, timestamped writes → PITR restore test to a chosen second → verify. Acceptance gate.
- If pass: backup-of-record; decide SimpleBackups’ fate. (W3) source follows the standby.
Access: admin op://IT/Databasus (Postgres Backup); UI http://100.68.157.49:4005
(tailnet-only). API: POST /api/v1/users/signin {email,password} → JWT bearer.
- Databasus PITR maturity (~1yr) — mitigated by the restore-test gate; pgBackRest is the documented fallback.
- WAL-from-standby completeness — proven by the restore test (no gaps).
- One more service to operate — offset by the dashboard + auto restore-verification.