Heatwave Kamal Stack — Architecture & Index

The containerized deployment stack that replaced Capistrano + Passenger.
This directory is the single source of truth for the new infrastructure.

Doc Covers
README.md (this file) Stack inventory, status, master architecture + network diagrams
DEPLOYING.md The deploy guidebook — bin/deploy, the deploy lifecycle, migrations, rollback
MANAGING.md Day-2 operations — accessories, DB restore, mailpit, secrets, scaling, provisioning a new box
TROUBLESHOOTING.md Runbook for the failure modes we've actually hit

Status (2026-06-14). Production and staging both run on Kamal on
Latitude bare-metal. Dallas (dal-latitude-heatwave-01, Tailscale
100.123.47.52) is the primary and hosts both environments; a cross-DC
PostgreSQL standby runs in Chicago (chi-latitude-heatwave-02,
100.68.157.49). The Capistrano + Passenger + Vultr stack was retired at the
2026-06-07 cutover. Historical record:
doc/tasks/202606022303_KAMAL_MIGRATION.md (cutover) and
doc/tasks/202606112045_DB_TIER_HA_ARCHITECTURE.md (two-region HA end-state).

Note: the network diagram and a few body sections below still describe the
pre-cutover Vultr topology and are being refreshed.


What changed vs. Capistrano

Concern Old (Capistrano + Passenger) New (Kamal)
Unit of deploy git pull + bundle on the host An OCI image built once, pushed, rolled out
Web server Passenger (Apache/nginx) Thruster → Puma in a container
Zero-downtime Passenger restart kamal-proxy rolling swap on a /up health check
Ingress nginx + origin TLS Cloudflare Tunnel (no public ports, no origin TLS)
Asset bridging linked_dirs public/javascripts/webpack Kamal asset_path host volume
Datastores External Postgres / Managed Valkey Kamal accessories (PG18 + Valkey ×3 + pgbouncer + HAProxy), co-located on the Latitude boxes in both envs
Secrets config/master.key on host 1Password resolver-only .kamal/secrets* (tracked in git)
Deploy command bin/deploy bin/deploykamal deploy
Provisioning Hand-built hosts Terraform/OpenTofu (infra/terraform/) + cloud-init

Stack inventory

Every moving piece of the new stack and where it's configured.

Compute & orchestration

  • Kamal 2.x — orchestrates build → push → rolling deploy. Config:
    config/deploy.yml (base/prod), config/deploy.staging.yml (staging overrides).
  • kamal-proxy — per-host reverse proxy giving zero-downtime rolling swaps.
    Listens on host :80, health-checks /up, no TLS (ssl: false).
  • Docker — installed by cloud-init (get.docker.com). All app + accessory
    containers attach to the kamal docker network and resolve each other by name.

The application image

  • Dockerfile — multi-stage (basebuildfinal), base
    ruby:4.0.5-slim. Build stage compiles gems + Yarn 4 / webpack assets; final
    stage is a slim runtime (gems + app + built assets, non-root rails user
    uid 1001). Entry: bin/docker-entrypoint; CMD bin/thrust bin/rails server.
  • Thruster — HTTP/2 + X-Sendfile front, listens :80, proxies to Puma :3000.
  • Registryeverything is on GitHub Container Registry (no Vultr CR):
    the app image is ghcr.io/warmlyyours/heatwave and the custom Postgres
    accessory image is ghcr.io/warmlyyours/heatwave-postgres:18 (the host's single
    ghcr.io login covers both).

Roles (containers Kamal runs)

  • web — Puma (4 workers × 3 threads, jemalloc), behind kamal-proxy.
  • sidekiq — a single consolidated Sidekiq process (SIDEKIQ_CONSOLIDATED=1)
    running the high/low/campaign capsules + the default set + the scheduler in one
    container. cmd: bundle exec sidekiq -C config/sidekiq.yml. Sidekiq Pro
    super_fetch makes rolling restarts safe; .kamal/hooks/pre-deploy quiets it
    (TSTP) before the swap.

Accessories (co-located on the box; staging detail below)

Both environments run their datastores as Kamal accessories on the Latitude
boxes (prod splits Postgres across Dallas + Chicago — see the note below the
table). The staging-specific accessories are:

  • postgres — custom PG18 image (ghcr.io/warmlyyours/heatwave-postgres:18,
    built from docker/postgresql.Dockerfile) with pgvector, hypopg, pg_repack,
    pg_stat_statements. Tuned down (shared_buffers=8GB) because the box is
    shared with the prod stack. Data on a host volume. Host-published
    127.0.0.1:5432 for local psql; the app reaches it as heatwave-postgres on
    the kamal network.
  • Valkey ×3valkey/valkey:9.1 in a 3-flavor split:
    heatwave-staging-valkey-cache (allkeys-lru), -sessions (noeviction),
    -queue (noeviction + AOF). RedisConfig routes to them per logical DB via
    REDIS_CACHE_HOST / REDIS_SESSIONS_HOST / REDIS_QUEUE_HOST (no single
    REDIS_HOST). Internal to the kamal network — not host-published. Mirrors
    the prod split (heatwave-valkey-{cache,sessions,queue}).
  • mailpit — SMTP sink + web UI. App/sidekiq deliver to heatwave-mailpit:1025;
    the UI is bound to the Tailscale interface only (http://100.123.47.52:8025),
    so captured staging mail (reset tokens etc.) is never publicly exposed.

Production runs the same accessories, just split across two Latitude boxes:
a PG18 primary in Dallas (heatwave-postgres) with a cross-DC streaming
standby in Chicago (heatwave-postgres-replica), fronted by per-node
pgbouncer and a TCP write-VIP HAProxy (heatwave-haproxy:6433, the app's
DATABASE_HOST) so a pg_promote flip reroutes with no app redeploy; the same
3-flavor Valkey split (heatwave-valkey-cache / -sessions / -queue);
and Databasus PITR → Cloudflare R2 backups off the Chicago standby. The old
Vultr Postgres (db4/db3) and Vultr Managed Valkey are gone. Full current
topology, hosts, ports, and image tags: doc/infrastructure/INFRASTRUCTURE_INVENTORY.md
and doc/tasks/202606112045_DB_TIER_HA_ARCHITECTURE.md.

Ingress & network

  • Cloudflare Tunnel (cloudflared, host systemd service, remotely managed
    — ingress configured in Cloudflare, not on the box). Outbound-only QUIC; the
    only inbound web path. Routes crm/www/api/mcp.warmlyyours.ws → http://localhost:80.
  • Cloudflare Access — SSO gate (the wy-employees group) in front of every
    staging hostname.
  • Tailscale — the admin/SSH plane (and, in the HA end-state, cross-region DB
    replication). Hosts get 100.x addresses; SSH is Tailscale-only.
  • Firewall, defense-in-depth — Latitude edge firewall (SSH from the Tailscale
    CGNAT range 100.64.0.0/10 only) + host UFW (default-deny inbound, allow
    lo + tailscale0 + :22) + a DOCKER-USER iptables chain that blocks
    public :80/:443 (Docker bypasses UFW for published ports) + Cloudflare Access.

Secrets

  • .kamal/secrets-common — shared: RAILS_MASTER_KEY (= config/master.key),
    BUNDLE_GEMS__CONTRIBSYS__COM (Sidekiq Pro), KAMAL_REGISTRY_PASSWORD (GHCR).
  • .kamal/secrets.staging — staging PG password + the staging
    Heatwave::Configuration env-key.
  • .kamal/secrets — prod PG password + production env-key (op://IT/Heatwave-Postgres
    must be created before cutover
    ).
  • All three are resolver-only (Kamal's 1Password adapter — no literal secrets)
    and therefore committed. See MANAGING.md → Secrets.

Provisioning (Infrastructure as Code)

  • infra/terraform/latitude/ — provisions a Latitude bare-metal box: SSH keys,
    cloud-init (deploy user uid 1001, Docker, Tailscale, UFW + DOCKER-USER,
    cloudflared), RAID-1, edge firewall.
  • infra/terraform/cloudflare/ — the tunnel (remotely managed) + DNS CNAMEs +
    Access app/policy for *.warmlyyours.ws.
  • infra/terraform/ (root) — the original Vultr provisioning module (being
    retired in favour of Latitude).

Deploy tooling & lifecycle hooks

  • bin/deploy — the wrapper around kamal deploy (clean-tree gate, 1Password
    unlock, gated migrations, sourcemap upload, edge-cache purge). See DEPLOYING.md.
  • .kamal/hooks/pre-build — stamps REVISION (git SHA) into the build context
    so webpack/AppSignal report a real revision.
  • .kamal/hooks/pre-deploy — quiets Sidekiq (TSTP) before the swap.
  • .kamal/hooks/post-deploy — clears REVISION + the Sidekiq quiet marker.
  • script/db_restore_kamal.sh — fast+deferred DB restore into the staging
    Postgres accessory (see MANAGING.md → Database restore).

Master architecture — staging (live)

flowchart TB
    user([User / browser])

    subgraph CF["Cloudflare edge"]
        tls["TLS termination<br/>+ WAF + cache"]
        access["Access SSO gate<br/>(wy-employees group)"]
        cft["Cloudflare Tunnel<br/>crm/www/api/mcp.warmlyyours.ws"]
    end

    subgraph BOX["Latitude bare-metal — dal-latitude-heatwave-01 (Tailscale 100.123.47.52)"]
        direction TB
        cfd["cloudflared<br/>(host systemd, outbound QUIC)"]
        proxy["kamal-proxy :80<br/>(rolling swap, /up healthcheck)"]

        subgraph NET["docker network: kamal"]
            direction TB
            web["web container<br/>Thruster :80 → Puma :3000"]
            sidekiq["sidekiq container<br/>consolidated capsules + scheduler"]
            pg[("postgres accessory<br/>PG18 · heatwave + heatwave_versions")]
            valkey[("valkey accessories ×3<br/>cache / sessions / queue")]
            mailpit["mailpit accessory<br/>SMTP :1025 / UI :8025"]
        end
    end

    admin([Operator]) -. "SSH / psql / mailpit UI<br/>over Tailscale" .-> BOX

    user -->|HTTPS| tls --> access --> cft
    cft -->|"QUIC (dialed out by cloudflared)"| cfd
    cfd -->|"http://localhost:80"| proxy --> web
    web --> pg & valkey
    web -->|SMTP| mailpit
    sidekiq --> pg & valkey

Request path: browser → Cloudflare (TLS, Access SSO) → Cloudflare Tunnel →
cloudflared on the box → http://localhost:80 (kamal-proxy) → web container
(Thruster :80 → Puma :3000). No inbound web ports are open on the host; the
tunnel is dialed outbound.


Production topology — pre-cutover snapshot (historical)

Historical. This section and the diagram below capture the pre-cutover
Vultr + Capistrano topology and the original Kamal target. Production cut over
to Kamal on Latitude on 2026-06-07 (Dallas primary + Chicago standby); see the
Status note at the top of this file and INFRASTRUCTURE_INVENTORY.md for the
current state.

flowchart TB
    user([User]) -->|HTTPS| cf["Cloudflare edge<br/>(TLS + WAF + Access on CRM)"]
    cf -->|Tunnel| cfd["cloudflared (host)"]
    cfd -->|"localhost:80"| proxy["kamal-proxy"]

    subgraph WEB1["web1 (Vultr, Ubuntu 26.04) — TODO provision"]
        proxy --> web["web container (Puma)"]
        proxy -.-> sk["sidekiq container<br/>(consolidated, co-located to start)"]
    end

    web -->|"public IP 45.63.79.22:5432<br/>(firewall allowlist + SCRAM)"| db4[("db4 — Postgres PRIMARY<br/>heatwave + heatwave_versions")]
    db4 -. "async replication" .-> db3[("db3 — replica")]
    web -->|"TLS, allowlist"| valkey[("Vultr Managed Valkey 7")]
    sk --> db4 & valkey

Cutover prerequisites (gated): create op://IT/Heatwave-Postgres, provision
web1, add its public IP to the db4 firewall group + the Valkey allowlist, then
bin/deploy production. Full sequence in
doc/tasks/202606022303_KAMAL_MIGRATION.md.


Network & security layers

flowchart LR
    subgraph internet["Public internet"]
        u([User]) ; op([Operator])
    end

    subgraph edge["Layer 1 — Cloudflare"]
        e1["TLS + WAF + rate limiting"]
        e2["Access SSO (Zero Trust)"]
    end

    subgraph latfw["Layer 2 — Latitude edge firewall"]
        l1["inbound :22 ← 100.64.0.0/10 only<br/>(Tailscale CGNAT) · default-deny"]
    end

    subgraph host["Layer 3 — host (UFW + DOCKER-USER)"]
        h1["UFW: default-deny in,<br/>allow lo + tailscale0 + :22"]
        h2["DOCKER-USER: DROP public :80/:443,<br/>RETURN on tailscale0"]
    end

    subgraph app["Layer 4 — app"]
        a1["accessories bound to 127.0.0.1<br/>or the Tailscale IP — never 0.0.0.0"]
        a2["web reachable only via the Tunnel"]
    end

    u -->|web| e1 --> e2 -->|"Tunnel (outbound)"| a2
    op -->|SSH/psql/UI| l1 --> h1 --> a1
    h2 --- a2

The web tier is reachable only through the Cloudflare Tunnel (no public port).
The operator tier (SSH, psql, mailpit UI) is reachable only over Tailscale.
DOCKER-USER exists because Docker inserts iptables rules ahead of UFW for
published ports — without it, a published :80 would be world-reachable despite
UFW's default-deny.


Key facts at a glance

Thing Value
Live staging host dal-latitude-heatwave-01, Tailscale 100.123.47.52 (Latitude bare metal, RAID-1)
Staging hostnames crm / www / api / mcp.warmlyyours.ws (TLD env = warmlyyours.ws)
Staging Access group wy-employees (0de0f290-f12c-4046-ae47-b66146f1a4ac)
App image ghcr.io/warmlyyours/heatwave (GHCR)
PG accessory image ghcr.io/warmlyyours/heatwave-postgres:18 (GHCR)
Docker network kamal (app + accessories resolve by name)
Web port path Cloudflare → tunnel → kamal-proxy :80 → Thruster :80 → Puma :3000
Deploy user deploy, uid 1001 (must match container USER 1001)
Cloudflare account 79b7f58cf035093b5ad11747df30369a
Staging zone warmlyyours.ws (d39acaed475782c4901d4a8e5908c1cb)
Prod DB PG18 primary heatwave-postgres on Dallas (100.123.47.52) + cross-DC streaming standby heatwave-postgres-replica on Chicago (100.68.157.49); app reaches it via HAProxy write-VIP heatwave-haproxy:6433 → pgbouncer. See INFRASTRUCTURE_INVENTORY.md
Prod cache/queue Valkey ×3 — heatwave-valkey-cache / -sessions / -queue (3-flavor split, routed per logical DB)
Prod backups Databasus PITR → Cloudflare R2 (off the Chicago standby)
Deploy command `bin/deploy [staging