HAProxy DB routing layer — design + staged rollout (2026-06-12)
Stage 3 of the DB-tier HA plan (202606112045_DB_TIER_HA_ARCHITECTURE.md), now detailed. PITR
(Stage 2) is DONE (202606120650_CHICAGO_PITR_HANDOFF.md); this is the automatic
failover-routing layer. It is a deliberate future investment: when a 3rd node + Patroni land,
the app-facing layer here does NOT change — only HAProxy’s health source does (see “Patroni path”).
Scope decision (settled 2026-06-12)
Section titled “Scope decision (settled 2026-06-12)”- NO app-level read/write split. The only replica is cross-DC (Chicago); measured Dallas→Chicago
RTT = ~21 ms (identical on tailnet and the native BBR pipe — it’s physics, not bandwidth), vs
~0.02–0.5 ms local. The primary sits at ~3% load / 99.84% cache hit, so offloading reads
would add a compounding per-query 21 ms to relieve pressure that doesn’t exist. Rails talks to
the primary only, locally.
database.yml’sprimary_replicastays pointed at the primary pooler (effectively unused);ActiveRecord::Middleware::DatabaseSelectoris NOT adopted. - Replica stays analyst-only — the
heatwave-db-roVIP for direct read-only SQL. Lag-and-latency tolerant ad-hoc queries; the app never depends on it. - HAProxy is purely a failover router (write → current primary), self-detecting, replacing the
manual
databases.inirepoint. Not a read-splitter, not query-aware.
Current path (what we’re changing)
Section titled “Current path (what we’re changing)”app ──(DATABASE_HOST=heatwave-pgbouncer)──▶ heatwave-pgbouncer (session mode, local accessory) └─ databases.ini: heatwave = host=heatwave-postgres … └─▶ local primary postgresA flip today = hand-edit /data/pgbouncer-prod/conf.d/databases.ini (swap host=heatwave-postgres
for the new primary) + RELOAD pgbouncer, plus repoint the Tailscale RW VIP. Manual, error-prone.
PgBouncer placement — researched (2026-06-12)
Section titled “PgBouncer placement — researched (2026-06-12)”Two valid patterns exist and authoritative sources genuinely split:
- In front (
app → pgbouncer → HAProxy → pg, Percona): pgbouncer exposes write/read aliases pointing at separate HAProxy ports — its payoff is alias-based read/write split, which we deliberately do NOT do. - Behind, one per DB node (
app → HAProxy → pgbouncer → pg): the de-facto Patroni-cluster pattern (autobase/postgresql_cluster). Each node’s pgbouncer points only at its LOCAL postgres (never reconfigured at a flip); HAProxy is the single failover-routing point.
Chosen: behind, per node. For Heatwave: (1) no R/W split ⇒ the in-front alias benefit is moot;
(2) it’s the canonical Patroni topology, so the explicit Patroni-future investment lands as a
drop-in; (3) cross-DC — per-node pgbouncer keeps pgbouncer→postgres always LOCAL (only the
app→HAProxy→leader path takes the ~21 ms hop at a flip), whereas in-front would pool across the WAN.
Heatwave already runs a pgbouncer co-located with each postgres node (Dallas heatwave-pgbouncer,
Chicago heatwave-pgbouncer-replica), so this is the natural fit.
Target path
Section titled “Target path”app ─(DATABASE_HOST=heatwave-haproxy)─▶ heatwave-haproxy (TCP, local app-host accessory) backend = { dallas, chicago } │ httpchk pg_is_in_recovery via each node's pg-health:8008 only the LEADER node is "up" └─▶ leader node's pgbouncer (session) ─▶ its LOCAL postgresEach node’s pgbouncer databases.ini → its local postgres — STATIC, never edited at a flip.
A flip becomes: pg_promote() the standby + re-stand-up the old primary. HAProxy’s health check
follows (recovery flips → the new leader’s pgbouncer goes “up”, the old goes “down”); no
databases.ini edit, no HAProxy reconfig, no app redeploy. pgbouncer stays session-mode (advisory
locks); HAProxy is TCP passthrough, so session semantics + locks survive.
Topology — local sidecar, NOT a central box
Section titled “Topology — local sidecar, NOT a central box”HAProxy runs as a kamal accessory co-located on the app host (Dallas now; follows the app to
Chicago at W3). It is part of the app’s local stack, exactly like heatwave-pgbouncer — so it is
not a new central SPOF: if it dies, the blast radius is that one app host (same as the local
pgbouncer dying), and the app host is already the failure unit. Multiple app hosts ⇒ one HAProxy each.
Pieces to build
Section titled “Pieces to build”1. Per-postgres-node health endpoint (mechanism B — self-detecting)
Section titled “1. Per-postgres-node health endpoint (mechanism B — self-detecting)”A tiny HTTP sidecar on each postgres host that runs SELECT pg_is_in_recovery() and returns
200 (primary, recovery=false) / 503 (standby, recovery=true). Built as the owned
docker/pg-health (alpine + socat + psql, ~15 lines — no third-party image in the DB path).
Kamal: accessories pg_health (Dallas heatwave-postgres) + pg_health_replica (Chicago
heatwave-postgres-replica), publishing :8008 on the tailnet IP only. Read-only, zero write impact —
safe to deploy first and independently (inert until kamal accessory boot).
2. HAProxy accessory (heatwave-haproxy) on the app host
Section titled “2. HAProxy accessory (heatwave-haproxy) on the app host”TCP frontend on :6433 (local); backend = both postgres nodes, each check via option httpchk
against its pg-health sidecar. Sketch (config/haproxy/production.cfg):
global maxconn 4000defaults mode tcp timeout connect 3s timeout client 12h # session-mode + session-scoped advisory locks: do NOT reap long-lived conns timeout server 12h default-server inter 2s rise 2 fall 3 on-marked-down shutdown-sessionslisten pg-primary bind 0.0.0.0:6433 option httpchk GET / # 200 = primary, 503 = standby http-check expect status 200 # All nodes in one pool; the health check keeps ONLY the current leader "up". Targets are the # per-node PGBOUNCERS (6432); the httpchk hits each node's pg-health (8008 → its postgres). server dallas 100.123.47.52:6432 check port 8008 # → Dallas heatwave-pgbouncer → local pg server chicago 100.68.157.49:6432 check port 8008 # → Chicago heatwave-pgbouncer-replica → local pgon-marked-down shutdown-sessions drops connections to a node the instant it stops being primary
(at a flip) so the app reconnects fast onto the new primary.
⚠️ This sketch is the pre-implementation design. The SHIPPED
config/haproxy/production.cfgdiffers in one critical way: the LOCAL node (Dallas) is addressed by KAMAL-NET DNS, not its host tailnet IP. A container cannot hairpin to a port its OWN host published on the tailnet interface — the first prod boot used100.123.47.52:…for Dallas and came up with the primary backend DOWN (L4 timeout: container →100.123.47.52:8008timed out, while container →heatwave-pg-health:8008returned 200). The real config therefore reaches Dallas viaheatwave-pgbouncer/heatwave-pg-health— which forces thetrack’d-probe pattern (HAProxy forbids a hostname incheck addr) plus aresolvers dockersection — and reaches only the REMOTE node (Chicago) by its tailnet IP100.68.157.49. Fixed inad45d9f12b. See the Risks section.
3. Point the app at HAProxy; each pgbouncer → its LOCAL postgres
Section titled “3. Point the app at HAProxy; each pgbouncer → its LOCAL postgres”App DATABASE_HOST → heatwave-haproxy:6433. Each node’s pgbouncer databases.ini stays pointed at
its LOCAL postgres (Dallas heatwave-postgres, Chicago heatwave-postgres-replica) — STATIC, never
edited at a flip. ⚠️ Chicago needs a RW pgbouncer co-located with its postgres for when it is leader;
heatwave-pgbouncer-replica already fronts that node — confirm it serves the RW path when promoted
(it pools to postgres regardless of recovery state), or add a dedicated RW pooler.
4. Slim the failover runbook
Section titled “4. Slim the failover runbook”The 202606081218 ping-pong flip step “edit databases.ini + reload + repoint VIP” collapses to just
promote + re-stand-up the old primary; HAProxy re-routes off the recovery-state check. Keep the
Tailscale VIP as the stable cross-DC address in front of HAProxy (or fold it in).
⚠️ The one safety invariant: single-primary
Section titled “⚠️ The one safety invariant: single-primary”Health-based routing is only as safe as the guarantee that exactly one node reports primary. If a
partition let both be writable (split-brain), both pg-health endpoints return 200 → HAProxy would
balance writes across both = data divergence. Mitigation: the controlled failover discipline —
demote/fence the old primary before promoting the standby (never two primaries). This is why we are
NOT doing cross-DC automatic failover yet. Patroni (below) replaces this discipline with etcd
consensus + fencing.
Patroni path (why this is a load-bearing investment)
Section titled “Patroni path (why this is a load-bearing investment)”When a same-DC HA pair + a 3rd-site witness exist and seconds-RTO is wanted:
- Unchanged: the
heatwave-haproxyaccessory, theapp → haproxy → pgbouncerpath, the app config, the whole app-facing layer. The investment carries forward verbatim. - Changes: (a) swap HAProxy’s
httpchkendpoint from thepg_is_in_recoverysidecar to Patroni’s REST API (GET /primary→ 200 on the leader) — a one-line backend edit; (b) Patroni/etcd replaces the manual controlled-failover script (automatic leader election + fencing, kills split-brain); (c) add the 3rd node. HAProxy + Patroni is the canonical pattern, so nothing here is throwaway.
Staged rollout (live DB path is touched ONLY at Phase 3)
Section titled “Staged rollout (live DB path is touched ONLY at Phase 3)”- Validate off-prod. Build
pg-health+ the HAProxy cfg; on staging (single node) confirm the health endpoint reports primary and HAProxy marks it up. Exercise the routing logic against a throwaway 2nd PG (promote it, watch HAProxy flip). (pg-health validated on staging 2026-06-12:curl 100.123.47.52:8008→200 primary.) - Deploy
pg-healthsidecars on both prod postgres hosts. Read-only; verifycurl 100.123.47.52:8008→ 200 (Dallas primary) andcurl 100.68.157.49:8008→ 503 (Chicago standby). Zero app impact. - Deploy the
heatwave-haproxyaccessory (not yet in the app path). ✅ BUILT + STAGING-VALIDATED 2026-06-12.config/haproxy/{production,staging}.cfg+ thehaproxyaccessory indeploy.yml/deploy.staging.yml(imagehaproxy:3.0-alpine, tag+index-digest pinned; TCP:6433, stats:8404). On staging: httpchk L7OK →pg-primary/stagingUP,pg_isready -h heatwave-staging-haproxy -p 6433→ accepting connections; a simulated leader loss (docker stoppg-health) took the server out of rotation afterfall 3(~6 s) andrise 2(~4 s) restored it. PROD ROLLOUT DONE 2026-06-12 (reboot pg_health + pg_health_replica → v3, then boot haproxy). ⚠️ The first boot used the IP-for-both-nodes config and came up with the primary backend DOWN (the hairpin bug — see the sketch note in §2 + Risks); after thead45d9f12bfix (Dallas via kamal-net DNS, Chicago via IP) andkamal accessory reboot haproxy, VERIFIED on100.123.47.52:8404/;csv: pg-primary/dallas UP (L7OK, primary) · pg-primary/chicago DOWN (503, standby) — write path on the leader. - Cut the app over to HAProxy (app
DATABASE_HOST+_VERSIONS→heatwave-haproxy:6433). ✅ PROD DONE 2026-06-12 (d00f89d03f) behind akamal app maintenancewindow: pre-verified the deploy-role path through :6433 → both DBs on the Dallas primary, then (in maintenance) verifiedAccount.countthrough HAProxy, went live, 0 DB-error log lines, real CRM traffic + active HAProxy sessions. Transparent — same Dallas primary, +sub-ms hop. (Staging cut over earlier.) Revert = both hosts back toheatwave-pgbouncer, ports 6432. - Failover drill (off-hours).
pg_promoteChicago → confirm HAProxy reroutes the write path, app reconnects onto Chicago, writes land there; then fail back. Validates the auto-reroute end-to-end.
Staging failover rehearsal (two-node staging — built 2026-06-12)
Section titled “Staging failover rehearsal (two-node staging — built 2026-06-12)”To rehearse a real primary→standby promotion before the prod Phase-4 drill, staging now runs a
same-host streaming standby + its own pg-health + pgbouncer, fronted by a two-node staging.cfg.
Topology (all on the staging box 100.123.47.52, reached by kamal-net DNS):
- Primary:
heatwave-staging-postgres(pg-health:8008→ 200) +heatwave-staging-pgbouncer(:6432). - Standby:
heatwave-staging-postgres-replica(:5433, pg-health:8009→ 503 while following) +heatwave-staging-pgbouncer-replica(:6434). The three accessories OVERRIDE the inherited prod*_replicakeys indeploy.staging.yml(so-d stagingnever touches Chicago); the standby reuses the primary’s tuning via the*pg_staging_cmdYAML anchor.
Bootstrap (one-time, idempotent — pg_basebackup; verified streaming replay_lag ~9 ms). Uses
trust-replication scoped to the kamal subnet 172.18.0.0/16 — staging-only, internal Docker net,
so no password-in-conninfo handling. On the primary: ALTER SYSTEM SET max_slot_wal_keep_size='10GB'
(so a dead standby can’t fill the disk) + a staging_standby physical slot + a
host replication deploy 172.18.0.0/16 trust pg_hba line (reload, no restart). Then
pg_basebackup -R -S staging_standby -X stream into /data/postgres/pg18-standby, and
kamal accessory boot postgres_replica pg_health_replica pgbouncer_replica -d staging. The standby
pooler’s databases.ini is the primary’s with the host swapped to heatwave-staging-postgres-replica.
The drill (repeatable). ⚠️ The staging app talks DIRECTLY to heatwave-staging-pgbouncer, so
cut it to heatwave-staging-haproxy:6433 first (staging Phase 3) or expect a brief app DB blip:
- Baseline:
curl 127.0.0.1:8404/\;csv→pg-primary/primaryUP,/replicaDOWN. - Controlled failover (demote BEFORE promote — single-primary invariant): stop
heatwave-staging-postgres→ HAProxy marksprimaryDOWN; thendocker exec heatwave-staging-postgres-replica psql -U deploy -c "SELECT pg_promote()"→ its pg-health flips to 200 → HAProxy marksreplicaUP. Write path moves to the replica. - Verify a write through
haproxy:6433lands on the (newly-primary) replica. - Restore: rebuild
heatwave-staging-postgresas a standby of the replica (re-run the bootstrap with primary/standby swapped) then switch back — OR, if the app is on HAProxy, leave it flipped (HAProxy doesn’t care which node is primary).
Recovery toolkit — bin/recovery
Section titled “Recovery toolkit — bin/recovery”A gum-menu operational toolkit (bin/recovery [staging|production]), extensible per scenario:
- Flip database primary↔standby — the PLANNED-switchover form of the failover. Sequenced so it
never routes onto the dead node (the transition window above): detect roles →
CHECKPOINT+ record the old primary’s LSN → drain the old node via the Runtime API admin socket (state maint+shutdown sessions) → stop it cleanly → wait for the standby to replay through that LSN (no data loss) →pg_promote→ wait for HAProxy to mark the new primary UP → rebuild the demoted node as a fresh streaming standby (pg_basebackup) → return the old node toready. Idempotent role detection (never assumes which node is primary); refuses split-brain / no-primary. The drain is best-effort — if the socket is unreachable the flip degrades to health-check timing. - Show DB topology — read-only: each node’s recovery state, pg-health, HAProxy server status.
- Flip app stack to another host / Valkey failover — stubs (intent documented in-script).
-y / RECOVERY_YES=1 runs an action non-interactively (bin/recovery staging flip-db -y) for
automation. Env-config-driven (the two nodes’ hosts / containers / data dirs / HAProxy server names);
The flip drains the old node via an in-container Runtime API admin socket
(stats socket /var/lib/haproxy/admin.sock level admin, reached by docker exec … socat — no network
port, so kamal-net peers can’t touch it) before promoting, so HAProxy never routes onto the dead node.
VERIFIED under concurrent load on staging: the demoted node went straight to MAINT (never “UP
while dead”), zero dead-node errors — only clean refusals during the ~10 s no-primary window, which
a retrying connection pool absorbs. (Truly-zero is impossible under single-primary-safe discipline;
two-primaries would be unsafe.)
Prod readiness (2026-06-12): admin socket applied (haproxy rebooted) + Runtime API verified;
bin/recovery production topology works (it uses the superuser postgres role — prod’s pg_hba is
local all all peer, so deploy fails locally but postgres works). Auto-rebuild IS now wired
(AUTO_REBUILD=1): rebuild_standby is env-driven — staging rebuilds as deploy over trust, prod as
the scram’d replication role, sourcing its password from 1Password
(op://IT/Heatwave-Replication-prod/password, piped to the host over SSH stdin — never on a command
line / in process args / the transcript) and running pg_basebackup -U replication … sslmode=require,
matching the existing Chicago standby’s primary_conninfo. The rebuild mechanism (remote helper +
password-on-stdin) is proven on staging (flipped both ways). Prod Phase 3 done 2026-06-12 (app on
heatwave-haproxy), so a prod flip is app-safe — the app follows the reroute.
Phase 4 prod drill — EXECUTED 2026-06-13 (round-trip Dallas→Chicago→Dallas). The flip + HAProxy reroute worked first try both ways; the cross-DC basebackup ran at ~620 MB/s (the 122 MB/s figure was conservative), so a full rebuild is ~6 min, not ~17. The drill surfaced two latent bugs, both now fixed:
- Stale replication password.
op://IT/Heatwave-Replication-prodhad drifted out of sync with the actualreplicationrole, so the auto-rebuild’spg_basebackupfailed auth — after it had wiped the demoted node’s data dir → prod went single-node (app stayed up on the new primary). Fix:rebuild_standbynow CAPTURES the password from the live standby’sprimary_conninfo(authoritative — it’s actively streaming with it); the op item is a fallback only, and was corrected. Added arebuild-standbycommand (re-basebackup a wiped/stale node with no flip), a per-envREPL_NET(prod cross-DC basebackup uses--network host— Chicago’s tailnet IP isn’t on kamal-net), and-c fast. - pgbouncer stale-cache after a rebuild. A node whose postgres was just wiped+rebuilt leaves its
pgbouncer caching a dead DNS for the old container (
server DNS lookup failed (server_login_retry)), so flipping TO that node serves errors until pgbouncer restarts (~15 s blip / 3 requests on fail-back). Fix:rebuild_standbynow restarts the rebuilt node’s pgbouncer (A_PGB/B_PGB).
Robust maintenance window — bin/maintenance {up|down} <env>. Quiet-only (web 503 + sidekiq TSTP)
left the app holding pool connections and pgbouncer caching its upstream; the robust window instead STOPS
web + sidekiq + (prod) the databasus PITR agent on Chicago, so nothing touches the DB during a flip and it
all reconnects fresh. Runbook: bin/maintenance up <env> → bin/recovery <env> flip-db →
bin/maintenance down <env>. (PITR note: databasus catches up the brief WAL gap on restart, but a fresh
base backup after a flip is the clean way to re-baseline.)
ZFS rollback net (double safety, 2026-06-13). rebuild_standby’s one irreversible move is wiping the
demoted node’s data dir before the basebackup. It now sudo zfs snapshots that dataset RIGHT BEFORE the
wipe and drops it only once the new standby is confirmed streaming — so a failed rebuild is recoverable
via zfs rollback (+ pg_rewind) instead of leaving the node empty: the storage net under the logical flip
safety. On failure the snapshot is KEPT with a rollback hint; if $odir isn’t on ZFS it warns and proceeds
(fail-safe). sudo df is required — the 0700/uid-999 data dir isn’t statable as deploy. Validated on
staging (snapshot taken → rebuild → streaming → dropped).
Risks / gotchas
Section titled “Risks / gotchas”- ⚠️ Container→own-host-IP hairpin (BIT US — prod hotfix
ad45d9f12b): a kamal accessory container CANNOT reach a port its OWN host published on the tailnet interface (container →100.123.47.52:8008= L4 timeout), even though the host itself curls it fine. So inproduction.cfgthe LOCAL node (Dallas) is addressed by kamal-net DNS (heatwave-pgbouncer/heatwave-pg-health, via atrack’d probe +resolvers docker), and only the REMOTE node (Chicago, a genuine peer) by its tailnet IP. The original IP-for-both config booted the primary backend DOWN; staging never caught it (single-node, already DNS). Applies to ANY container→own-host-published-port pattern in this fleet. - Advisory locks / session mode: TCP passthrough +
timeout client/server 12hso long-lived session connections holding session-scoped advisory locks are never reaped. (28 files use them.) - Health-check flapping: conservative
inter 2s rise 2 fall 3; a transient blip must not reroute. - Failover reconnect storm: expected + brief — old-primary write conns error on demotion, the app
reconnects, pgbouncer’s reserve pool absorbs the burst (already tuned,
reserve_pool1s). - ⚠️ Failover transition window (FOUND in the staging drill 2026-06-12): with
fall 3 inter 2sthe OLD primary staysUPfor ~6 s after it dies (until 3 checks fail) while the NEW primary comesUP, so HAProxy briefly has BOTH in the pool and load-balances some connections onto the dead old node → those error (server closed the connection unexpectedly). The staging drill hit exactly this: the test write during the window failed, then succeeded onceprimarywent DOWN. For an UNPLANNED failover it’s unavoidable (the app retries; pgbouncer reserve pool absorbs it). For a PLANNED switchover this is now FIXED: both configs carrystats socket /var/lib/haproxy/admin.sock level admin, andbin/recovery’s flip doesset server pg-primary/<old> state maint+shutdown sessionsBEFORE promoting, so HAProxy is never routing onto the old node when it dies (the app gets clean refusals during the brief no-primary moment, not dead-node errors). Patroni later eliminates even that via consensus + fencing. - Split-brain: see the safety invariant — controlled discipline now, Patroni consensus later.
- Cross-DC transient: after a flip, until the whole app stack moves DC, app→primary is the ~21 ms cross-DC hop. Inherent to the ping-pong model; HAProxy automates routing, it doesn’t remove that.
- pg-health log noise — RESOLVED (pg-health
v3, 2026-06-12). The pre-fixsocat EXECrelay logged a benign “connection reset by peer” on every check — it was socat’s relay read()ing the socket after HAProxy abortively closes a satisfied check, NOT the unread request (v2’s “drain the request” theory was wrong, kept only for tidy FINs). Fix =EXEC:…,nofork: socat execshealth.shstraight onto the socket with no relay to reset; the per-request bound moved intohealth.sh(read -t 5+PGCONNECT_TIMEOUT=2) since-Tno longer applies. Verified on staging: 0 reset lines under HAProxy’s 2 s polling (was 15/30 s on v2). Prod gets it via thepg_health*reboots above.
References
Section titled “References”202606112045_DB_TIER_HA_ARCHITECTURE.md(Stage 3),202606081218_PG18_PROD_PINGPONG_RUNBOOK.md(flip)github.com/virtualstaticvoid/pgsql_haproxy(mechanism B), Perconapgsql-checkblog (mechanism A)- Patroni + HAProxy reference architecture (the future-state target)