HAProxy — DB write-VIP failover router
The single point that decides “which Postgres node takes writes right now.”
heatwave-haproxy is a Kamal accessory co-located on the app host; the app’s
DATABASE_HOST points at it, and it TCP-passthrough routes every connection to
the live primary’s PgBouncer. A pg_promote on the standby reroutes the
write path with no databases.ini edit and no app redeploy — HAProxy
follows each node’s health probe.
This is the operator-facing summary. Deep design, the staged rollout, the
staging rehearsal, and every gotcha live in
doc/tasks/202606120820_HAPROXY_ROUTING_LAYER.md.
Config: config/haproxy/production.cfg + config/haproxy/staging.cfg
(authoritative). The haproxy accessory is defined in config/deploy.yml /
config/deploy.staging.yml — image haproxy:3.0-alpine (tag + index digest
pinned). It is not a new central SPOF: it’s part of the app’s local stack
(like the co-located PgBouncer), so one HAProxy per app host; if it dies the
blast radius is that one host, which is already the failure unit.
Topology
Section titled “Topology” app (Rails) ── DATABASE_HOST=heatwave-haproxy:6433 ──┐ ▼ ┌──────────── heatwave-haproxy (Dallas, kamal accessory) ───────────┐ │ listen pg-primary :6433 (mode tcp, passthrough) │ │ listen stats :8404 (stats UI + /metrics for netdata) │ │ │ │ backend = ONE pool of both nodes; httpchk keeps only the │ │ current LEADER "up" → all writes go to that node's pgbouncer │ └───────────────┬──────────────────────────────┬───────────────────┘ track 200 │ pg-health │ pg-health track 503 ▼ (LOCAL: kamal-net DNS) ▼ (REMOTE: tailnet IP) ┌──── dal-01 (100.123.47.52) ────┐ ┌──── chi-02 (100.68.157.49) ────┐ │ heatwave-pg-health :8008 │ │ heatwave-pg-health-replica :8008│ │ GET / → 200 "primary" │ │ GET / → 503 "standby" │ │ heatwave-pgbouncer :6432 ◀ UP │ │ heatwave-pgbouncer-replica :6432│ │ heatwave-postgres :5432 │ │ heatwave-postgres-replica :5432│ │ (PG18 PRIMARY, RW) │ │ (PG18 STANDBY) ◀ DOWN │ └────────────────────────────────┘ └─────────────────────────────────┘Routing-decision logic
Section titled “Routing-decision logic”- One pool, health-gated. Both Postgres nodes sit in a single backend. A
per-node
pg-healthsidecar runsSELECT pg_is_in_recovery()and answers HTTP200 "primary"(recovery=false) /503 "standby"(recovery=true) on:8008. HAProxyoption httpchk(GET /, expect200) marks only the node answering 200 as “up”, so exactly the current leader’s PgBouncer takes every write. - A flip is just promote + rebuild.
pg_promoteon the standby flips its recovery state → its pg-health goes 200 → HAProxy marks it up and the old primary down. Each node’s PgBouncer points only at its local Postgres (static, never edited at a flip), so nothing downstream reconfigures. - Session-safe passthrough.
mode tcp+timeout client/server 12hso session-mode pooling and session-scoped advisory locks (28 call sites) pass through untouched and long-lived idle connections are never reaped. - Flap guard.
default-server inter 2s rise 2 fall 3 on-marked-down shutdown-sessions— a transient blip can’t reroute, and connections to a node drop the instant it stops being primary so the app reconnects onto the new leader fast. - Single-primary is the one safety invariant. Health-based routing is only safe while exactly one node reports primary; two writable nodes (split-brain) would both return 200 and HAProxy would balance writes across both. Mitigation is the controlled flip discipline (demote/fence the old primary before promoting) — which is why cross-DC automatic failover is not enabled yet.
Addressing gotcha (load-bearing)
Section titled “Addressing gotcha (load-bearing)”A container cannot hairpin to a port its own host published on the
tailnet interface. So in production.cfg:
- Dallas (LOCAL) is reached by kamal-net DNS (
heatwave-pgbouncer/heatwave-pg-health) — which forces atrack’d probe backend (HAProxy forbids a hostname incheck addr) plus aresolvers dockersection. - Chicago (REMOTE), a genuine peer, is reached by its tailnet IP
100.68.157.49.
The first prod boot used the host IP for both nodes and came up with the primary
backend DOWN (L4 timeout to its own 100.123.47.52:8008); fixed in
ad45d9f12b (“HAProxy prod — reach the LOCAL node by kamal-net DNS, not its host
IP”, direct-to-master 2026-06-12). Revisit the DNS-vs-IP split only at a DC
migration, never at a flip.
Ops commands
Section titled “Ops commands”bin/recovery <env> topology— read-only: each node’s recovery state, pg-health, and HAProxy server status. Start here.bin/recovery <env> flip-db— planned primary↔standby switchover. Drains the old node via HAProxy’s in-container Runtime API admin socket (set server … state maint+shutdown sessions) before promoting, so HAProxy never routes onto the dying node; waits for the standby to replay the old primary’s LSN (no data loss), thenpg_promote, then rebuilds the demoted node as a fresh streaming standby. (-y/RECOVERY_YES=1for non-interactive.)bin/recovery <env> rebuild-standby— re-basebackup a wiped/stale node with no flip.sudo zfs snapshots the demoted dataset right before the wipe (kept on failure as azfs rollbacknet), restarts the rebuilt node’s PgBouncer to clear a stale upstream-DNS cache.bin/maintenance {up,down} <env>— full maintenance window:up= proxy 503 → stop web + sidekiq (+ on prod the Chicagodatabasuscontroller container),downreverses it. Wrap a flip in it:bin/maintenance up <env>→bin/recovery <env> flip-db→bin/maintenance down <env>.
The Phase-4 prod drill (round-trip Dallas→Chicago→Dallas) passed — the flip + reroute worked both directions; cross-DC basebackup ran ~620 MB/s (~6 min full rebuild).
Stats UI
Section titled “Stats UI”http://100.123.47.52:8404/ (tailnet only) — shows which backend is UP;
invaluable for watching the write path move during a flip. The same port serves
/metrics (Prometheus) for the netdata go.d/haproxy collector. CSV view:
http://100.123.47.52:8404/;csv.
See also
Section titled “See also”doc/tasks/202606120820_HAPROXY_ROUTING_LAYER.md— full design, PgBouncer-placement rationale, staged rollout, staging rehearsal,bin/recovery/bin/maintenanceinternals, the Patroni future path, and every risk/gotcha.doc/tasks/202606081218_PG18_PROD_PINGPONG_RUNBOOK.md— the PG18 primary↔standby ping-pong / failover runbook (HAProxy collapsed its manual “editdatabases.ini+ reload + repoint VIP” step to just promote + rebuild).INFRASTRUCTURE_INVENTORY.md— fleet, full port-exposure map, and the data-tier write/read path in context.PGBOUNCER.md— the per-node session-mode poolers HAProxy routes to.