HAProxy — DB write-VIP failover router
The single point that decides "which Postgres node takes writes right now."
heatwave-haproxy is a Kamal accessory co-located on the app host; the app's
DATABASE_HOST points at it, and it TCP-passthrough routes every connection to
the live primary's PgBouncer. A pg_promote on the standby reroutes the
write path with no databases.ini edit and no app redeploy — HAProxy
follows each node's health probe.
This is the operator-facing summary. Deep design, the staged rollout, the
staging rehearsal, and every gotcha live in
doc/tasks/202606120820_HAPROXY_ROUTING_LAYER.md.
Config: config/haproxy/production.cfg + config/haproxy/staging.cfg
(authoritative). The haproxy accessory is defined in config/deploy.yml /
config/deploy.staging.yml — image haproxy:3.0-alpine (tag + index digest
pinned). It is not a new central SPOF: it's part of the app's local stack
(like the co-located PgBouncer), so one HAProxy per app host; if it dies the
blast radius is that one host, which is already the failure unit.
Topology
app (Rails) ── DATABASE_HOST=heatwave-haproxy:6433 ──┐
▼
┌──────────── heatwave-haproxy (Dallas, kamal accessory) ───────────┐
│ listen pg-primary :6433 (mode tcp, passthrough) │
│ listen stats :8404 (stats UI + /metrics for netdata) │
│ │
│ backend = ONE pool of both nodes; httpchk keeps only the │
│ current LEADER "up" → all writes go to that node's pgbouncer │
└───────────────┬──────────────────────────────┬───────────────────┘
track 200 │ pg-health │ pg-health track 503
▼ (LOCAL: kamal-net DNS) ▼ (REMOTE: tailnet IP)
┌──── dal-01 (100.123.47.52) ────┐ ┌──── chi-02 (100.68.157.49) ────┐
│ heatwave-pg-health :8008 │ │ heatwave-pg-health-replica :8008│
│ GET / → 200 "primary" │ │ GET / → 503 "standby" │
│ heatwave-pgbouncer :6432 ◀ UP │ │ heatwave-pgbouncer-replica :6432│
│ heatwave-postgres :5432 │ │ heatwave-postgres-replica :5432│
│ (PG18 PRIMARY, RW) │ │ (PG18 STANDBY) ◀ DOWN │
└────────────────────────────────┘ └─────────────────────────────────┘
Routing-decision logic
- One pool, health-gated. Both Postgres nodes sit in a single backend. A
per-nodepg-healthsidecar runsSELECT pg_is_in_recovery()and answers
HTTP200 "primary"(recovery=false) /503 "standby"(recovery=true) on
:8008. HAProxyoption httpchk(GET /, expect200) marks only the
node answering 200 as "up", so exactly the current leader's PgBouncer takes
every write. - A flip is just promote + rebuild.
pg_promoteon the standby flips its
recovery state → its pg-health goes 200 → HAProxy marks it up and the old
primary down. Each node's PgBouncer points only at its local Postgres
(static, never edited at a flip), so nothing downstream reconfigures. - Session-safe passthrough.
mode tcp+timeout client/server 12hso
session-mode pooling and session-scoped advisory locks (28 call sites) pass
through untouched and long-lived idle connections are never reaped. - Flap guard.
default-server inter 2s rise 2 fall 3 on-marked-down shutdown-sessions— a transient blip can't reroute, and connections to a node
drop the instant it stops being primary so the app reconnects onto the new
leader fast. - Single-primary is the one safety invariant. Health-based routing is only
safe while exactly one node reports primary; two writable nodes (split-brain)
would both return 200 and HAProxy would balance writes across both. Mitigation
is the controlled flip discipline (demote/fence the old primary before
promoting) — which is why cross-DC automatic failover is not enabled yet.
Addressing gotcha (load-bearing)
A container cannot hairpin to a port its own host published on the
tailnet interface. So in production.cfg:
- Dallas (LOCAL) is reached by kamal-net DNS (
heatwave-pgbouncer/
heatwave-pg-health) — which forces atrack'd probe backend (HAProxy
forbids a hostname incheck addr) plus aresolvers dockersection. - Chicago (REMOTE), a genuine peer, is reached by its tailnet IP
100.68.157.49.
The first prod boot used the host IP for both nodes and came up with the primary
backend DOWN (L4 timeout to its own 100.123.47.52:8008); fixed in
ad45d9f12b ("HAProxy prod — reach the LOCAL node by kamal-net DNS, not its host
IP", direct-to-master 2026-06-12). Revisit the DNS-vs-IP split only at a DC
migration, never at a flip.
Ops commands
bin/recovery <env> topology— read-only: each node's recovery state,
pg-health, and HAProxy server status. Start here.bin/recovery <env> flip-db— planned primary↔standby switchover. Drains the
old node via HAProxy's in-container Runtime API admin socket
(set server … state maint+shutdown sessions) before promoting, so
HAProxy never routes onto the dying node; waits for the standby to replay the
old primary's LSN (no data loss), thenpg_promote, then rebuilds the demoted
node as a fresh streaming standby. (-y/RECOVERY_YES=1for
non-interactive.)bin/recovery <env> rebuild-standby— re-basebackup a wiped/stale node with
no flip.sudo zfs snapshots the demoted dataset right before the wipe (kept
on failure as azfs rollbacknet), restarts the rebuilt node's PgBouncer to
clear a stale upstream-DNS cache.bin/maintenance {up,down} <env>— full maintenance window:up= proxy 503 →
stop web + sidekiq (+ on prod the Chicagodatabasuscontroller container),down
reverses it. Wrap a flip in it:bin/maintenance up <env>→
bin/recovery <env> flip-db→bin/maintenance down <env>.
The Phase-4 prod drill (round-trip Dallas→Chicago→Dallas) passed — the flip +
reroute worked both directions; cross-DC basebackup ran ~620 MB/s (~6 min full
rebuild).
Stats UI
http://100.123.47.52:8404/ (tailnet only) — shows which backend is UP;
invaluable for watching the write path move during a flip. The same port serves
/metrics (Prometheus) for the netdata go.d/haproxy collector. CSV view:
http://100.123.47.52:8404/;csv.
See also
doc/tasks/202606120820_HAPROXY_ROUTING_LAYER.md
— full design, PgBouncer-placement rationale, staged rollout, staging
rehearsal,bin/recovery/bin/maintenanceinternals, the Patroni future path,
and every risk/gotcha.doc/tasks/202606081218_PG18_PROD_PINGPONG_RUNBOOK.md
— the PG18 primary↔standby ping-pong / failover runbook (HAProxy collapsed its
manual "editdatabases.ini+ reload + repoint VIP" step to just promote +
rebuild).INFRASTRUCTURE_INVENTORY.md— fleet, full
port-exposure map, and the data-tier write/read path in context.PGBOUNCER.md— the per-node session-mode poolers HAProxy
routes to.