HAProxy — DB write-VIP failover router

The single point that decides "which Postgres node takes writes right now."
heatwave-haproxy is a Kamal accessory co-located on the app host; the app's
DATABASE_HOST points at it, and it TCP-passthrough routes every connection to
the live primary's PgBouncer. A pg_promote on the standby reroutes the
write path with no databases.ini edit and no app redeploy — HAProxy
follows each node's health probe.

This is the operator-facing summary. Deep design, the staged rollout, the
staging rehearsal, and every gotcha live in
doc/tasks/202606120820_HAPROXY_ROUTING_LAYER.md.

Config: config/haproxy/production.cfg + config/haproxy/staging.cfg
(authoritative). The haproxy accessory is defined in config/deploy.yml /
config/deploy.staging.yml — image haproxy:3.0-alpine (tag + index digest
pinned). It is not a new central SPOF: it's part of the app's local stack
(like the co-located PgBouncer), so one HAProxy per app host; if it dies the
blast radius is that one host, which is already the failure unit.

Topology

   app (Rails) ── DATABASE_HOST=heatwave-haproxy:6433 ──┐
                                                        ▼
        ┌──────────── heatwave-haproxy (Dallas, kamal accessory) ───────────┐
        │  listen pg-primary  :6433  (mode tcp, passthrough)                 │
        │  listen stats       :8404  (stats UI + /metrics for netdata)       │
        │                                                                    │
        │  backend = ONE pool of both nodes; httpchk keeps only the          │
        │  current LEADER "up" → all writes go to that node's pgbouncer      │
        └───────────────┬──────────────────────────────┬───────────────────┘
            track 200    │ pg-health                    │ pg-health   track 503
                         ▼ (LOCAL: kamal-net DNS)        ▼ (REMOTE: tailnet IP)
        ┌──── dal-01 (100.123.47.52) ────┐   ┌──── chi-02 (100.68.157.49) ────┐
        │ heatwave-pg-health  :8008      │   │ heatwave-pg-health-replica :8008│
        │   GET / → 200 "primary"        │   │   GET / → 503 "standby"         │
        │ heatwave-pgbouncer  :6432 ◀ UP │   │ heatwave-pgbouncer-replica :6432│
        │ heatwave-postgres   :5432      │   │ heatwave-postgres-replica  :5432│
        │   (PG18 PRIMARY, RW)           │   │   (PG18 STANDBY)  ◀ DOWN        │
        └────────────────────────────────┘   └─────────────────────────────────┘

Routing-decision logic

  • One pool, health-gated. Both Postgres nodes sit in a single backend. A
    per-node pg-health sidecar runs SELECT pg_is_in_recovery() and answers
    HTTP 200 "primary" (recovery=false) / 503 "standby" (recovery=true) on
    :8008. HAProxy option httpchk (GET /, expect 200) marks only the
    node answering 200
    as "up", so exactly the current leader's PgBouncer takes
    every write.
  • A flip is just promote + rebuild. pg_promote on the standby flips its
    recovery state → its pg-health goes 200 → HAProxy marks it up and the old
    primary down. Each node's PgBouncer points only at its local Postgres
    (static, never edited at a flip), so nothing downstream reconfigures.
  • Session-safe passthrough. mode tcp + timeout client/server 12h so
    session-mode pooling and session-scoped advisory locks (28 call sites) pass
    through untouched and long-lived idle connections are never reaped.
  • Flap guard. default-server inter 2s rise 2 fall 3 on-marked-down shutdown-sessions — a transient blip can't reroute, and connections to a node
    drop the instant it stops being primary so the app reconnects onto the new
    leader fast.
  • Single-primary is the one safety invariant. Health-based routing is only
    safe while exactly one node reports primary; two writable nodes (split-brain)
    would both return 200 and HAProxy would balance writes across both. Mitigation
    is the controlled flip discipline (demote/fence the old primary before
    promoting) — which is why cross-DC automatic failover is not enabled yet.

Addressing gotcha (load-bearing)

A container cannot hairpin to a port its own host published on the
tailnet interface. So in production.cfg:

  • Dallas (LOCAL) is reached by kamal-net DNS (heatwave-pgbouncer /
    heatwave-pg-health) — which forces a track'd probe backend (HAProxy
    forbids a hostname in check addr) plus a resolvers docker section.
  • Chicago (REMOTE), a genuine peer, is reached by its tailnet IP
    100.68.157.49.

The first prod boot used the host IP for both nodes and came up with the primary
backend DOWN (L4 timeout to its own 100.123.47.52:8008); fixed in
ad45d9f12b ("HAProxy prod — reach the LOCAL node by kamal-net DNS, not its host
IP", direct-to-master 2026-06-12). Revisit the DNS-vs-IP split only at a DC
migration, never at a flip.

Ops commands

  • bin/recovery <env> topology — read-only: each node's recovery state,
    pg-health, and HAProxy server status. Start here.
  • bin/recovery <env> flip-db — planned primary↔standby switchover. Drains the
    old node via HAProxy's in-container Runtime API admin socket
    (set server … state maint + shutdown sessions) before promoting, so
    HAProxy never routes onto the dying node; waits for the standby to replay the
    old primary's LSN (no data loss), then pg_promote, then rebuilds the demoted
    node as a fresh streaming standby. (-y / RECOVERY_YES=1 for
    non-interactive.)
  • bin/recovery <env> rebuild-standby — re-basebackup a wiped/stale node with
    no flip. sudo zfs snapshots the demoted dataset right before the wipe (kept
    on failure as a zfs rollback net), restarts the rebuilt node's PgBouncer to
    clear a stale upstream-DNS cache.
  • bin/maintenance {up,down} <env> — full maintenance window: up = proxy 503 →
    stop web + sidekiq (+ on prod the Chicago databasus controller container), down
    reverses it. Wrap a flip in it: bin/maintenance up <env>
    bin/recovery <env> flip-dbbin/maintenance down <env>.

The Phase-4 prod drill (round-trip Dallas→Chicago→Dallas) passed — the flip +
reroute worked both directions; cross-DC basebackup ran ~620 MB/s (~6 min full
rebuild).

Stats UI

http://100.123.47.52:8404/ (tailnet only) — shows which backend is UP;
invaluable for watching the write path move during a flip. The same port serves
/metrics (Prometheus) for the netdata go.d/haproxy collector. CSV view:
http://100.123.47.52:8404/;csv.

See also