Skip to content

HAProxy — DB write-VIP failover router

The single point that decides “which Postgres node takes writes right now.” heatwave-haproxy is a Kamal accessory co-located on the app host; the app’s DATABASE_HOST points at it, and it TCP-passthrough routes every connection to the live primary’s PgBouncer. A pg_promote on the standby reroutes the write path with no databases.ini edit and no app redeploy — HAProxy follows each node’s health probe.

This is the operator-facing summary. Deep design, the staged rollout, the staging rehearsal, and every gotcha live in doc/tasks/202606120820_HAPROXY_ROUTING_LAYER.md.

Config: config/haproxy/production.cfg + config/haproxy/staging.cfg (authoritative). The haproxy accessory is defined in config/deploy.yml / config/deploy.staging.yml — image haproxy:3.0-alpine (tag + index digest pinned). It is not a new central SPOF: it’s part of the app’s local stack (like the co-located PgBouncer), so one HAProxy per app host; if it dies the blast radius is that one host, which is already the failure unit.

app (Rails) ── DATABASE_HOST=heatwave-haproxy:6433 ──┐
┌──────────── heatwave-haproxy (Dallas, kamal accessory) ───────────┐
│ listen pg-primary :6433 (mode tcp, passthrough) │
│ listen stats :8404 (stats UI + /metrics for netdata) │
│ │
│ backend = ONE pool of both nodes; httpchk keeps only the │
│ current LEADER "up" → all writes go to that node's pgbouncer │
└───────────────┬──────────────────────────────┬───────────────────┘
track 200 │ pg-health │ pg-health track 503
▼ (LOCAL: kamal-net DNS) ▼ (REMOTE: tailnet IP)
┌──── dal-01 (100.123.47.52) ────┐ ┌──── chi-02 (100.68.157.49) ────┐
│ heatwave-pg-health :8008 │ │ heatwave-pg-health-replica :8008│
│ GET / → 200 "primary" │ │ GET / → 503 "standby" │
│ heatwave-pgbouncer :6432 ◀ UP │ │ heatwave-pgbouncer-replica :6432│
│ heatwave-postgres :5432 │ │ heatwave-postgres-replica :5432│
│ (PG18 PRIMARY, RW) │ │ (PG18 STANDBY) ◀ DOWN │
└────────────────────────────────┘ └─────────────────────────────────┘
  • One pool, health-gated. Both Postgres nodes sit in a single backend. A per-node pg-health sidecar runs SELECT pg_is_in_recovery() and answers HTTP 200 "primary" (recovery=false) / 503 "standby" (recovery=true) on :8008. HAProxy option httpchk (GET /, expect 200) marks only the node answering 200 as “up”, so exactly the current leader’s PgBouncer takes every write.
  • A flip is just promote + rebuild. pg_promote on the standby flips its recovery state → its pg-health goes 200 → HAProxy marks it up and the old primary down. Each node’s PgBouncer points only at its local Postgres (static, never edited at a flip), so nothing downstream reconfigures.
  • Session-safe passthrough. mode tcp + timeout client/server 12h so session-mode pooling and session-scoped advisory locks (28 call sites) pass through untouched and long-lived idle connections are never reaped.
  • Flap guard. default-server inter 2s rise 2 fall 3 on-marked-down shutdown-sessions — a transient blip can’t reroute, and connections to a node drop the instant it stops being primary so the app reconnects onto the new leader fast.
  • Single-primary is the one safety invariant. Health-based routing is only safe while exactly one node reports primary; two writable nodes (split-brain) would both return 200 and HAProxy would balance writes across both. Mitigation is the controlled flip discipline (demote/fence the old primary before promoting) — which is why cross-DC automatic failover is not enabled yet.

A container cannot hairpin to a port its own host published on the tailnet interface. So in production.cfg:

  • Dallas (LOCAL) is reached by kamal-net DNS (heatwave-pgbouncer / heatwave-pg-health) — which forces a track’d probe backend (HAProxy forbids a hostname in check addr) plus a resolvers docker section.
  • Chicago (REMOTE), a genuine peer, is reached by its tailnet IP 100.68.157.49.

The first prod boot used the host IP for both nodes and came up with the primary backend DOWN (L4 timeout to its own 100.123.47.52:8008); fixed in ad45d9f12b (“HAProxy prod — reach the LOCAL node by kamal-net DNS, not its host IP”, direct-to-master 2026-06-12). Revisit the DNS-vs-IP split only at a DC migration, never at a flip.

  • bin/recovery <env> topology — read-only: each node’s recovery state, pg-health, and HAProxy server status. Start here.
  • bin/recovery <env> flip-db — planned primary↔standby switchover. Drains the old node via HAProxy’s in-container Runtime API admin socket (set server … state maint + shutdown sessions) before promoting, so HAProxy never routes onto the dying node; waits for the standby to replay the old primary’s LSN (no data loss), then pg_promote, then rebuilds the demoted node as a fresh streaming standby. (-y / RECOVERY_YES=1 for non-interactive.)
  • bin/recovery <env> rebuild-standby — re-basebackup a wiped/stale node with no flip. sudo zfs snapshots the demoted dataset right before the wipe (kept on failure as a zfs rollback net), restarts the rebuilt node’s PgBouncer to clear a stale upstream-DNS cache.
  • bin/maintenance {up,down} <env> — full maintenance window: up = proxy 503 → stop web + sidekiq (+ on prod the Chicago databasus controller container), down reverses it. Wrap a flip in it: bin/maintenance up <env>bin/recovery <env> flip-dbbin/maintenance down <env>.

The Phase-4 prod drill (round-trip Dallas→Chicago→Dallas) passed — the flip + reroute worked both directions; cross-DC basebackup ran ~620 MB/s (~6 min full rebuild).

http://100.123.47.52:8404/ (tailnet only) — shows which backend is UP; invaluable for watching the write path move during a flip. The same port serves /metrics (Prometheus) for the netdata go.d/haproxy collector. CSV view: http://100.123.47.52:8404/;csv.