Skip to content

PgBouncer connection pooling

A session-mode PgBouncer fronts Postgres on both staging and production as a Kamal accessory (heatwave-pgbouncer). This is the architecture + decision record; for operations (rebuild/update, bootstrap, failover) see the pgbouncer skill, and for exact bootstrap commands see config/pgbouncer/README.md.

app (web ./bin/rails, consolidated sidekiq)
│ DATABASE_HOST / DATABASE_HOST_VERSIONS = heatwave-haproxy:6433
heatwave-haproxy (Kamal accessory, write-VIP failover router, :6433, mode tcp passthrough)
│ routes every connection to the CURRENT primary's pgbouncer (health-gated; see below)
heatwave-pgbouncer (Kamal accessory, kamal network, :6432, session mode)
│ pools: heatwave + heatwave_versions
heatwave-postgres (local Postgres accessory, :5432)

Since 2026-06-13 the app no longer points at the bouncer directly — it points at the heatwave-haproxy:6433 write-VIP, a TCP-passthrough failover router that forwards to whichever node is the live primary, so a pg_promote reroutes writes with no databases.ini edit and no app redeploy. PgBouncer is unchanged underneath; HAProxy just sits in front. See HAPROXY.md for the routing layer. (To revert to direct pooling, set DATABASE_HOST / DATABASE_HOST_VERSIONS back to heatwave-pgbouncer:6432.)

The pooler is co-located with the app on each host: the app always talks to a local bouncer (now via the local HAProxy), and the bouncer’s backend is the local Postgres accessory. The heatwaveheatwave_versions FDW is internal to Postgres (loopback) and is unaffected — the bouncer just needs a pool per database.

  1. Cap the post-consolidation connection demand. The Kamal cutover collapsed the four Sidekiq processes into one (SIDEKIQ_CONSOLIDATED=1) reserving 49 worker threads; PR #1072 sized its pool at DB_POOL=55 (after the default-5 pool caused ActiveRecord::ConnectionTimeoutError across workers — AppSignal #5951–#5961). Steady-state demand (~150–177 conns) fits under the 197 usable (max_connections=200superuser_reserved_connections=3), but a rolling deploy briefly runs old + new containers together — sidekiq alone then needs 2 × (55+55) = 220 to the two DBs, over the limit → the new container boots into FATAL: sorry, too many clients already → failed deploy. The pooler caps server connections (max_db_connections=90/DB) and queues the surplus instead of letting Postgres reject it. This cliff did not exist at DB_POOL=5; #1072’s bump created it, and PgBouncer is its structural fix.
  2. A repointable connection layer for the PG16→18 ping-pong. During the Dallas↔Chicago migration, failover became a backend repoint on the bouncer rather than an app redeploy. (That migration is complete — prod cut over to PG18.4 in Dallas on 2026-06-10.) See the ping-pong runbook.

The app relies on three session-scoped behaviours that transaction pooling would silently break (each transaction would land on a different backend):

  • Advisory locks — 28 files use with_advisory_lock/pg_advisory* (plus the Rails migration advisory lock).
  • LISTEN/NOTIFYapp/models/liquid/order_drop.rb.
  • Per-connection SET statement_timeoutconfig/database.yml variables:.

So pool_mode = session. The trade-off: session mode does not multiplex at steady state (one server connection is bound to a client for its whole session), so the pooler here is a failover-indirection + connection-ceiling layer, not a connection multiplier. default_pool_size is therefore set to cover real demand (transparent passthrough) and max_db_connections is the hard ceiling that only bites during the deploy-overlap storm. A future optimization could route the read-only reading role through a second, transaction-mode port once the Dallas PG18 replica exists.

paramvaluewhy
pool_modesessionadvisory locks + LISTEN/NOTIFY + SET (above)
default_pool_size80≥ heatwave peak (sidekiq DB_POOL=55 + web 4×5) → transparent
max_db_connections90 / DBhard ceiling; 2 DBs × 90 = 180 < 197 usable
max_client_conn2000absorbs old+new container overlap during a deploy
min_pool_size10warm servers for the deploy handoff
query_wait_timeout30 sdeploy-overlap clients queue rather than error
server_idle_timeout600 sreap idle servers so steady state tracks the active set

Per-DB demand, peak: sidekiq 1×55, web 4×5, plus the lazy reading-role pools (ApplicationViewRecordprimary_replica) and a little monitoring/console overhead.

auth_type = scram-sha-256 + auth_query. A low-privilege pgbouncer login role owns nothing but EXECUTE on pgbouncer.get_auth(text) — a SECURITY DEFINER function (owned by a superuser) that returns a user’s verifier from pg_shadow. PgBouncer authenticates an incoming deploy client against that verifier and relays to the backend via SCRAM pass-through, so the app deploy password is never in the pooler’s files. The only secret stored is the plaintext pgbouncer-role password in /data/pgbouncer/userlist.txt — required because PgBouncer must authenticate as the auth-user to run auth_query, and a stored SCRAM verifier (StoredKey) is one-way and can’t produce a client proof for that login. The pgbouncer role is low-privilege; its password lives in 1Password as Heatwave-PgBouncer-{staging,production}.

ghcr.io/warmlyyours/heatwave-pgbouncer:1.25.2, built from the upstream release tarball (pinned by sha256) via docker/pgbouncer.Dockerfile, Alpine multi-stage, non-root. We build our own rather than depend on a wrapper (edoburu lags upstream) so the supply chain is ours and rebuildable on the next CVE — 1.25.2 carries the SCRAM (CVE-2026-6665) + auth_query search_path (CVE-2025-12819) fixes that land in our auth path. Rebuild/bump procedure: the pgbouncer skill.

  • Staging (dal-latitude-heatwave-01) — live through the bouncer since commit 2c7401da8e; backend Postgres is PG18.
  • Production — fronted by the bouncer since the 2026-06-08 Chicago cutover (verified with real traffic: 30+ active heatwave connections, session mode, zero auth errors, maxwait=0, no recurrence of #5951–#5961). The PG16→18 ping-pong then moved prod to Dallas on PG18.4 (2026-06-10); the bouncer is the repointable layer that made that a backend repoint rather than an app redeploy.

The two hosts differ in DB bootstrap mechanics (staging trust + deploy-owned /data; prod peer auth → connect as the postgres superuser + root-owned /datasudo) — captured in the pgbouncer skill + config/pgbouncer/README.md.

  • Operations (build/update, bootstrap, failover, gotchas): pgbouncer skill
  • Exact bootstrap commands: config/pgbouncer/README.md
  • Deploys / accessories / 1Password: kamal-deploy skill
  • PG16→18 migration that motivated the repointable layer: ping-pong runbook