HAProxy DB routing layer — design + staged rollout (2026-06-12)

Stage 3 of the DB-tier HA plan (202606112045_DB_TIER_HA_ARCHITECTURE.md), now detailed. PITR (Stage 2) is DONE (202606120650_CHICAGO_PITR_HANDOFF.md); this is the automatic failover-routing layer. It is a deliberate future investment: when a 3rd node + Patroni land, the app-facing layer here does NOT change — only HAProxy’s health source does (see “Patroni path”).

Scope decision (settled 2026-06-12)

NO app-level read/write split. The only replica is cross-DC (Chicago); measured Dallas→Chicago RTT = ~21 ms (identical on tailnet and the native BBR pipe — it’s physics, not bandwidth), vs ~0.02–0.5 ms local. The primary sits at ~3% load / 99.84% cache hit, so offloading reads would add a compounding per-query 21 ms to relieve pressure that doesn’t exist. Rails talks to the primary only, locally. database.yml’s primary_replica stays pointed at the primary pooler (effectively unused); ActiveRecord::Middleware::DatabaseSelector is NOT adopted.
Replica stays analyst-only — the heatwave-db-ro VIP for direct read-only SQL. Lag-and-latency tolerant ad-hoc queries; the app never depends on it.
HAProxy is purely a failover router (write → current primary), self-detecting, replacing the manual databases.ini repoint. Not a read-splitter, not query-aware.

Current path (what we’re changing)

app ──(DATABASE_HOST=heatwave-pgbouncer)──▶ heatwave-pgbouncer (session mode, local accessory)
                                              └─ databases.ini:  heatwave = host=heatwave-postgres …
                                                                  └─▶ local primary postgres

A flip today = hand-edit /data/pgbouncer-prod/conf.d/databases.ini (swap host=heatwave-postgres for the new primary) + RELOAD pgbouncer, plus repoint the Tailscale RW VIP. Manual, error-prone.

PgBouncer placement — researched (2026-06-12)

Two valid patterns exist and authoritative sources genuinely split:

In front (app → pgbouncer → HAProxy → pg, Percona): pgbouncer exposes write/read aliases pointing at separate HAProxy ports — its payoff is alias-based read/write split, which we deliberately do NOT do.
Behind, one per DB node (app → HAProxy → pgbouncer → pg): the de-facto Patroni-cluster pattern (autobase/postgresql_cluster). Each node’s pgbouncer points only at its LOCAL postgres (never reconfigured at a flip); HAProxy is the single failover-routing point.

Chosen: behind, per node. For Heatwave: (1) no R/W split ⇒ the in-front alias benefit is moot; (2) it’s the canonical Patroni topology, so the explicit Patroni-future investment lands as a drop-in; (3) cross-DC — per-node pgbouncer keeps pgbouncer→postgres always LOCAL (only the app→HAProxy→leader path takes the ~21 ms hop at a flip), whereas in-front would pool across the WAN. Heatwave already runs a pgbouncer co-located with each postgres node (Dallas heatwave-pgbouncer, Chicago heatwave-pgbouncer-replica), so this is the natural fit.

Target path

app ─(DATABASE_HOST=heatwave-haproxy)─▶ heatwave-haproxy (TCP, local app-host accessory)
        backend = { dallas, chicago } │ httpchk pg_is_in_recovery via each node's pg-health:8008
        only the LEADER node is "up"  └─▶ leader node's pgbouncer (session) ─▶ its LOCAL postgres

Each node’s pgbouncer databases.ini → its local postgres — STATIC, never edited at a flip. A flip becomes: pg_promote() the standby + re-stand-up the old primary. HAProxy’s health check follows (recovery flips → the new leader’s pgbouncer goes “up”, the old goes “down”); no databases.ini edit, no HAProxy reconfig, no app redeploy. pgbouncer stays session-mode (advisory locks); HAProxy is TCP passthrough, so session semantics + locks survive.

Topology — local sidecar, NOT a central box

HAProxy runs as a kamal accessory co-located on the app host (Dallas now; follows the app to Chicago at W3). It is part of the app’s local stack, exactly like heatwave-pgbouncer — so it is not a new central SPOF: if it dies, the blast radius is that one app host (same as the local pgbouncer dying), and the app host is already the failure unit. Multiple app hosts ⇒ one HAProxy each.

Pieces to build

1. Per-postgres-node health endpoint (mechanism B — self-detecting)

A tiny HTTP sidecar on each postgres host that runs SELECT pg_is_in_recovery() and returns 200 (primary, recovery=false) / 503 (standby, recovery=true). Built as the owned docker/pg-health (alpine + socat + psql, ~15 lines — no third-party image in the DB path). Kamal: accessories pg_health (Dallas heatwave-postgres) + pg_health_replica (Chicago heatwave-postgres-replica), publishing :8008 on the tailnet IP only. Read-only, zero write impact — safe to deploy first and independently (inert until kamal accessory boot).

2. HAProxy accessory (`heatwave-haproxy`) on the app host

TCP frontend on :6433 (local); backend = both postgres nodes, each check via option httpchk against its pg-health sidecar. Sketch (config/haproxy/production.cfg):

global
  maxconn 4000
defaults
  mode tcp
  timeout connect 3s
  timeout client  12h          # session-mode + session-scoped advisory locks: do NOT reap long-lived conns
  timeout server  12h
  default-server inter 2s rise 2 fall 3 on-marked-down shutdown-sessions
listen pg-primary
  bind 0.0.0.0:6433
  option httpchk GET /            # 200 = primary, 503 = standby
  http-check expect status 200
  # All nodes in one pool; the health check keeps ONLY the current leader "up". Targets are the
  # per-node PGBOUNCERS (6432); the httpchk hits each node's pg-health (8008 → its postgres).
  server dallas  100.123.47.52:6432 check port 8008   # → Dallas heatwave-pgbouncer → local pg
  server chicago 100.68.157.49:6432 check port 8008   # → Chicago heatwave-pgbouncer-replica → local pg

on-marked-down shutdown-sessions drops connections to a node the instant it stops being primary (at a flip) so the app reconnects fast onto the new primary.

⚠️ This sketch is the pre-implementation design. The SHIPPED config/haproxy/production.cfg differs in one critical way: the LOCAL node (Dallas) is addressed by KAMAL-NET DNS, not its host tailnet IP. A container cannot hairpin to a port its OWN host published on the tailnet interface — the first prod boot used 100.123.47.52:… for Dallas and came up with the primary backend DOWN (L4 timeout: container → 100.123.47.52:8008 timed out, while container → heatwave-pg-health:8008 returned 200). The real config therefore reaches Dallas via heatwave-pgbouncer / heatwave-pg-health — which forces the track’d-probe pattern (HAProxy forbids a hostname in check addr) plus a resolvers docker section — and reaches only the REMOTE node (Chicago) by its tailnet IP 100.68.157.49. Fixed in ad45d9f12b. See the Risks section.

3. Point the app at HAProxy; each pgbouncer → its LOCAL postgres

App DATABASE_HOST → heatwave-haproxy:6433. Each node’s pgbouncer databases.ini stays pointed at its LOCAL postgres (Dallas heatwave-postgres, Chicago heatwave-postgres-replica) — STATIC, never edited at a flip. ⚠️ Chicago needs a RW pgbouncer co-located with its postgres for when it is leader; heatwave-pgbouncer-replica already fronts that node — confirm it serves the RW path when promoted (it pools to postgres regardless of recovery state), or add a dedicated RW pooler.

4. Slim the failover runbook

The 202606081218 ping-pong flip step “edit databases.ini + reload + repoint VIP” collapses to just promote + re-stand-up the old primary; HAProxy re-routes off the recovery-state check. Keep the Tailscale VIP as the stable cross-DC address in front of HAProxy (or fold it in).

⚠️ The one safety invariant: single-primary

Health-based routing is only as safe as the guarantee that exactly one node reports primary. If a partition let both be writable (split-brain), both pg-health endpoints return 200 → HAProxy would balance writes across both = data divergence. Mitigation: the controlled failover discipline — demote/fence the old primary before promoting the standby (never two primaries). This is why we are NOT doing cross-DC automatic failover yet. Patroni (below) replaces this discipline with etcd consensus + fencing.

Patroni path (why this is a load-bearing investment)

When a same-DC HA pair + a 3rd-site witness exist and seconds-RTO is wanted:

Unchanged: the heatwave-haproxy accessory, the app → haproxy → pgbouncer path, the app config, the whole app-facing layer. The investment carries forward verbatim.
Changes: (a) swap HAProxy’s httpchk endpoint from the pg_is_in_recovery sidecar to Patroni’s REST API (GET /primary → 200 on the leader) — a one-line backend edit; (b) Patroni/etcd replaces the manual controlled-failover script (automatic leader election + fencing, kills split-brain); (c) add the 3rd node. HAProxy + Patroni is the canonical pattern, so nothing here is throwaway.

Staged rollout (live DB path is touched ONLY at Phase 3)

Validate off-prod. Build pg-health + the HAProxy cfg; on staging (single node) confirm the health endpoint reports primary and HAProxy marks it up. Exercise the routing logic against a throwaway 2nd PG (promote it, watch HAProxy flip). (pg-health validated on staging 2026-06-12: curl 100.123.47.52:8008 → 200 primary.)
Deploy pg-health sidecars on both prod postgres hosts. Read-only; verify curl 100.123.47.52:8008 → 200 (Dallas primary) and curl 100.68.157.49:8008 → 503 (Chicago standby). Zero app impact.
Deploy the heatwave-haproxy accessory (not yet in the app path). ✅ BUILT + STAGING-VALIDATED 2026-06-12. config/haproxy/{production,staging}.cfg + the haproxy accessory in deploy.yml / deploy.staging.yml (image haproxy:3.0-alpine, tag+index-digest pinned; TCP :6433, stats :8404). On staging: httpchk L7OK → pg-primary/staging UP, pg_isready -h heatwave-staging-haproxy -p 6433 → accepting connections; a simulated leader loss (docker stop pg-health) took the server out of rotation after fall 3 (~6 s) and rise 2 (~4 s) restored it. PROD ROLLOUT DONE 2026-06-12 (reboot pg_health + pg_health_replica → v3, then boot haproxy). ⚠️ The first boot used the IP-for-both-nodes config and came up with the primary backend DOWN (the hairpin bug — see the sketch note in §2 + Risks); after the ad45d9f12b fix (Dallas via kamal-net DNS, Chicago via IP) and kamal accessory reboot haproxy, VERIFIED on 100.123.47.52:8404/;csv: pg-primary/dallas UP (L7OK, primary) · pg-primary/chicago DOWN (503, standby) — write path on the leader.
Cut the app over to HAProxy (app DATABASE_HOST + _VERSIONS → heatwave-haproxy:6433). ✅ PROD DONE 2026-06-12 (d00f89d03f) behind a kamal app maintenance window: pre-verified the deploy-role path through :6433 → both DBs on the Dallas primary, then (in maintenance) verified Account.count through HAProxy, went live, 0 DB-error log lines, real CRM traffic + active HAProxy sessions. Transparent — same Dallas primary, +sub-ms hop. (Staging cut over earlier.) Revert = both hosts back to heatwave-pgbouncer, ports 6432.
Failover drill (off-hours). pg_promote Chicago → confirm HAProxy reroutes the write path, app reconnects onto Chicago, writes land there; then fail back. Validates the auto-reroute end-to-end.

Staging failover rehearsal (two-node staging — built 2026-06-12)

To rehearse a real primary→standby promotion before the prod Phase-4 drill, staging now runs a same-host streaming standby + its own pg-health + pgbouncer, fronted by a two-node staging.cfg.

Topology (all on the staging box 100.123.47.52, reached by kamal-net DNS):

Primary: heatwave-staging-postgres (pg-health :8008 → 200) + heatwave-staging-pgbouncer (:6432).
Standby: heatwave-staging-postgres-replica (:5433, pg-health :8009 → 503 while following) + heatwave-staging-pgbouncer-replica (:6434). The three accessories OVERRIDE the inherited prod *_replica keys in deploy.staging.yml (so -d staging never touches Chicago); the standby reuses the primary’s tuning via the *pg_staging_cmd YAML anchor.

Bootstrap (one-time, idempotent — pg_basebackup; verified streaming replay_lag ~9 ms). Uses trust-replication scoped to the kamal subnet 172.18.0.0/16 — staging-only, internal Docker net, so no password-in-conninfo handling. On the primary: ALTER SYSTEM SET max_slot_wal_keep_size='10GB' (so a dead standby can’t fill the disk) + a staging_standby physical slot + a host replication deploy 172.18.0.0/16 trust pg_hba line (reload, no restart). Then pg_basebackup -R -S staging_standby -X stream into /data/postgres/pg18-standby, and kamal accessory boot postgres_replica pg_health_replica pgbouncer_replica -d staging. The standby pooler’s databases.ini is the primary’s with the host swapped to heatwave-staging-postgres-replica.

The drill (repeatable). ⚠️ The staging app talks DIRECTLY to heatwave-staging-pgbouncer, so cut it to heatwave-staging-haproxy:6433 first (staging Phase 3) or expect a brief app DB blip:

Baseline: curl 127.0.0.1:8404/\;csv → pg-primary/primary UP, /replica DOWN.
Controlled failover (demote BEFORE promote — single-primary invariant): stop heatwave-staging-postgres → HAProxy marks primary DOWN; then docker exec heatwave-staging-postgres-replica psql -U deploy -c "SELECT pg_promote()" → its pg-health flips to 200 → HAProxy marks replica UP. Write path moves to the replica.
Verify a write through haproxy:6433 lands on the (newly-primary) replica.
Restore: rebuild heatwave-staging-postgres as a standby of the replica (re-run the bootstrap with primary/standby swapped) then switch back — OR, if the app is on HAProxy, leave it flipped (HAProxy doesn’t care which node is primary).

Recovery toolkit — `bin/recovery`

A gum-menu operational toolkit (bin/recovery [staging|production]), extensible per scenario:

Flip database primary↔standby — the PLANNED-switchover form of the failover. Sequenced so it never routes onto the dead node (the transition window above): detect roles → CHECKPOINT + record the old primary’s LSN → drain the old node via the Runtime API admin socket (state maint + shutdown sessions) → stop it cleanly → wait for the standby to replay through that LSN (no data loss) → pg_promote → wait for HAProxy to mark the new primary UP → rebuild the demoted node as a fresh streaming standby (pg_basebackup) → return the old node to ready. Idempotent role detection (never assumes which node is primary); refuses split-brain / no-primary. The drain is best-effort — if the socket is unreachable the flip degrades to health-check timing.
Show DB topology — read-only: each node’s recovery state, pg-health, HAProxy server status.
Flip app stack to another host / Valkey failover — stubs (intent documented in-script).

-y / RECOVERY_YES=1 runs an action non-interactively (bin/recovery staging flip-db -y) for automation. Env-config-driven (the two nodes’ hosts / containers / data dirs / HAProxy server names); The flip drains the old node via an in-container Runtime API admin socket (stats socket /var/lib/haproxy/admin.sock level admin, reached by docker exec … socat — no network port, so kamal-net peers can’t touch it) before promoting, so HAProxy never routes onto the dead node. VERIFIED under concurrent load on staging: the demoted node went straight to MAINT (never “UP while dead”), zero dead-node errors — only clean refusals during the ~10 s no-primary window, which a retrying connection pool absorbs. (Truly-zero is impossible under single-primary-safe discipline; two-primaries would be unsafe.)

Prod readiness (2026-06-12): admin socket applied (haproxy rebooted) + Runtime API verified; bin/recovery production topology works (it uses the superuser postgres role — prod’s pg_hba is local all all peer, so deploy fails locally but postgres works). Auto-rebuild IS now wired (AUTO_REBUILD=1): rebuild_standby is env-driven — staging rebuilds as deploy over trust, prod as the scram’d replication role, sourcing its password from 1Password (op://IT/Heatwave-Replication-prod/password, piped to the host over SSH stdin — never on a command line / in process args / the transcript) and running pg_basebackup -U replication … sslmode=require, matching the existing Chicago standby’s primary_conninfo. The rebuild mechanism (remote helper + password-on-stdin) is proven on staging (flipped both ways). Prod Phase 3 done 2026-06-12 (app on heatwave-haproxy), so a prod flip is app-safe — the app follows the reroute.

Phase 4 prod drill — EXECUTED 2026-06-13 (round-trip Dallas→Chicago→Dallas). The flip + HAProxy reroute worked first try both ways; the cross-DC basebackup ran at ~620 MB/s (the 122 MB/s figure was conservative), so a full rebuild is ~6 min, not ~17. The drill surfaced two latent bugs, both now fixed:

Stale replication password. op://IT/Heatwave-Replication-prod had drifted out of sync with the actual replication role, so the auto-rebuild’s pg_basebackup failed auth — after it had wiped the demoted node’s data dir → prod went single-node (app stayed up on the new primary). Fix: rebuild_standby now CAPTURES the password from the live standby’s primary_conninfo (authoritative — it’s actively streaming with it); the op item is a fallback only, and was corrected. Added a rebuild-standby command (re-basebackup a wiped/stale node with no flip), a per-env REPL_NET (prod cross-DC basebackup uses --network host — Chicago’s tailnet IP isn’t on kamal-net), and -c fast.
pgbouncer stale-cache after a rebuild. A node whose postgres was just wiped+rebuilt leaves its pgbouncer caching a dead DNS for the old container (server DNS lookup failed (server_login_retry)), so flipping TO that node serves errors until pgbouncer restarts (~15 s blip / 3 requests on fail-back). Fix: rebuild_standby now restarts the rebuilt node’s pgbouncer (A_PGB/B_PGB).

Robust maintenance window — bin/maintenance {up|down} <env>. Quiet-only (web 503 + sidekiq TSTP) left the app holding pool connections and pgbouncer caching its upstream; the robust window instead STOPS web + sidekiq + (prod) the databasus PITR agent on Chicago, so nothing touches the DB during a flip and it all reconnects fresh. Runbook: bin/maintenance up <env> → bin/recovery <env> flip-db → bin/maintenance down <env>. (PITR note: databasus catches up the brief WAL gap on restart, but a fresh base backup after a flip is the clean way to re-baseline.)

ZFS rollback net (double safety, 2026-06-13). rebuild_standby’s one irreversible move is wiping the demoted node’s data dir before the basebackup. It now sudo zfs snapshots that dataset RIGHT BEFORE the wipe and drops it only once the new standby is confirmed streaming — so a failed rebuild is recoverable via zfs rollback (+ pg_rewind) instead of leaving the node empty: the storage net under the logical flip safety. On failure the snapshot is KEPT with a rollback hint; if $odir isn’t on ZFS it warns and proceeds (fail-safe). sudo df is required — the 0700/uid-999 data dir isn’t statable as deploy. Validated on staging (snapshot taken → rebuild → streaming → dropped).

Risks / gotchas

⚠️ Container→own-host-IP hairpin (BIT US — prod hotfix ad45d9f12b): a kamal accessory container CANNOT reach a port its OWN host published on the tailnet interface (container → 100.123.47.52:8008 = L4 timeout), even though the host itself curls it fine. So in production.cfg the LOCAL node (Dallas) is addressed by kamal-net DNS (heatwave-pgbouncer / heatwave-pg-health, via a track’d probe + resolvers docker), and only the REMOTE node (Chicago, a genuine peer) by its tailnet IP. The original IP-for-both config booted the primary backend DOWN; staging never caught it (single-node, already DNS). Applies to ANY container→own-host-published-port pattern in this fleet.
Advisory locks / session mode: TCP passthrough + timeout client/server 12h so long-lived session connections holding session-scoped advisory locks are never reaped. (28 files use them.)
Health-check flapping: conservative inter 2s rise 2 fall 3; a transient blip must not reroute.
Failover reconnect storm: expected + brief — old-primary write conns error on demotion, the app reconnects, pgbouncer’s reserve pool absorbs the burst (already tuned, reserve_pool 1s).
⚠️ Failover transition window (FOUND in the staging drill 2026-06-12): with fall 3 inter 2s the OLD primary stays UP for ~6 s after it dies (until 3 checks fail) while the NEW primary comes UP, so HAProxy briefly has BOTH in the pool and load-balances some connections onto the dead old node → those error (server closed the connection unexpectedly). The staging drill hit exactly this: the test write during the window failed, then succeeded once primary went DOWN. For an UNPLANNED failover it’s unavoidable (the app retries; pgbouncer reserve pool absorbs it). For a PLANNED switchover this is now FIXED: both configs carry stats socket /var/lib/haproxy/admin.sock level admin, and bin/recovery’s flip does set server pg-primary/<old> state maint + shutdown sessions BEFORE promoting, so HAProxy is never routing onto the old node when it dies (the app gets clean refusals during the brief no-primary moment, not dead-node errors). Patroni later eliminates even that via consensus + fencing.
Split-brain: see the safety invariant — controlled discipline now, Patroni consensus later.
Cross-DC transient: after a flip, until the whole app stack moves DC, app→primary is the ~21 ms cross-DC hop. Inherent to the ping-pong model; HAProxy automates routing, it doesn’t remove that.
pg-health log noise — RESOLVED (pg-health v3, 2026-06-12). The pre-fix socat EXEC relay logged a benign “connection reset by peer” on every check — it was socat’s relay read()ing the socket after HAProxy abortively closes a satisfied check, NOT the unread request (v2’s “drain the request” theory was wrong, kept only for tidy FINs). Fix = EXEC:…,nofork: socat execs health.sh straight onto the socket with no relay to reset; the per-request bound moved into health.sh (read -t 5 + PGCONNECT_TIMEOUT=2) since -T no longer applies. Verified on staging: 0 reset lines under HAProxy’s 2 s polling (was 15/30 s on v2). Prod gets it via the pg_health* reboots above.

References

202606112045_DB_TIER_HA_ARCHITECTURE.md (Stage 3), 202606081218_PG18_PROD_PINGPONG_RUNBOOK.md (flip)
github.com/virtualstaticvoid/pgsql_haproxy (mechanism B), Percona pgsql-check blog (mechanism A)
Patroni + HAProxy reference architecture (the future-state target)