Heatwave Kamal Stack — Architecture & Index

The containerized deployment stack that replaced Capistrano + Passenger. This directory is the single source of truth for the new infrastructure.

Doc	Covers
README.md (this file)	Stack inventory, status, master architecture + network diagrams
DEPLOYING.md	The deploy guidebook — `bin/deploy`, the deploy lifecycle, migrations, rollback
MANAGING.md	Day-2 operations — accessories, DB restore, mailpit, secrets, scaling, provisioning a new box
TROUBLESHOOTING.md	Runbook for the failure modes we’ve actually hit

Status (2026-06-14). Production and staging both run on Kamal on Latitude bare-metal. Dallas (dal-latitude-heatwave-01, Tailscale 100.123.47.52) is the primary and hosts both environments; a cross-DC PostgreSQL standby runs in Chicago (chi-latitude-heatwave-02, 100.68.157.49). The Capistrano + Passenger + Vultr stack was retired at the 2026-06-07 cutover. Historical record: doc/tasks/202606022303_KAMAL_MIGRATION.md (cutover) and doc/tasks/202606112045_DB_TIER_HA_ARCHITECTURE.md (two-region HA end-state).

Note: the network diagram and a few body sections below still describe the pre-cutover Vultr topology and are being refreshed.

What changed vs. Capistrano

Concern	Old (Capistrano + Passenger)	New (Kamal)
Unit of deploy	`git pull` + `bundle` on the host	An OCI image built once, pushed, rolled out
Web server	Passenger (Apache/nginx)	Thruster → Puma in a container
Zero-downtime	Passenger restart	kamal-proxy rolling swap on a `/up` health check
Ingress	nginx + origin TLS	Cloudflare Tunnel (no public ports, no origin TLS)
Asset bridging	`linked_dirs public/javascripts/webpack`	Kamal `asset_path` host volume
Datastores	External Postgres / Managed Valkey	Kamal accessories (PG18 + Valkey ×3 + pgbouncer + HAProxy), co-located on the Latitude boxes in both envs
Secrets	`config/master.key` on host	1Password resolver-only `.kamal/secrets*` (tracked in git)
Deploy command	`bin/deploy`	`bin/deploy` → `kamal deploy`
Provisioning	Hand-built hosts	Terraform/OpenTofu (`infra/terraform/`) + cloud-init

Stack inventory

Every moving piece of the new stack and where it’s configured.

Compute & orchestration

Kamal 2.x — orchestrates build → push → rolling deploy. Config: config/deploy.yml (base/prod), config/deploy.staging.yml (staging overrides).
kamal-proxy — per-host reverse proxy giving zero-downtime rolling swaps. Listens on host :80, health-checks /up, no TLS (ssl: false).
Docker — installed by cloud-init (get.docker.com). All app + accessory containers attach to the kamal docker network and resolve each other by name.

The application image

Dockerfile — multi-stage (base → build → final), base ruby:4.0.5-slim. Build stage compiles gems + Yarn 4 / webpack assets; final stage is a slim runtime (gems + app + built assets, non-root rails user uid 1001). Entry: bin/docker-entrypoint; CMD bin/thrust bin/rails server.
Thruster — HTTP/2 + X-Sendfile front, listens :80, proxies to Puma :3000.
Registry — everything is on GitHub Container Registry (no Vultr CR): the app image is ghcr.io/warmlyyours/heatwave and the custom Postgres accessory image is ghcr.io/warmlyyours/heatwave-postgres:18 (the host’s single ghcr.io login covers both).

Roles (containers Kamal runs)

web — Puma (4 workers × 3 threads, jemalloc), behind kamal-proxy.
sidekiq — a single consolidated Sidekiq process (SIDEKIQ_CONSOLIDATED=1) running the high/low/campaign capsules + the default set + the scheduler in one container. cmd: bundle exec sidekiq -C config/sidekiq.yml. Sidekiq Pro super_fetch makes rolling restarts safe; .kamal/hooks/pre-deploy quiets it (TSTP) before the swap.

Accessories (co-located on the box; staging detail below)

Both environments run their datastores as Kamal accessories on the Latitude boxes (prod splits Postgres across Dallas + Chicago — see the note below the table). The staging-specific accessories are:

postgres — custom PG18 image (ghcr.io/warmlyyours/heatwave-postgres:18, built from docker/postgresql.Dockerfile) with pgvector, hypopg, pg_repack, pg_stat_statements. Tuned down (shared_buffers=8GB) because the box is shared with the prod stack. Data on a host volume. Host-published 127.0.0.1:5432 for local psql; the app reaches it as heatwave-postgres on the kamal network.
Valkey ×3 — valkey/valkey:9.1 in a 3-flavor split: heatwave-staging-valkey-cache (allkeys-lru), -sessions (noeviction), -queue (noeviction + AOF). RedisConfig routes to them per logical DB via REDIS_CACHE_HOST / REDIS_SESSIONS_HOST / REDIS_QUEUE_HOST (no single REDIS_HOST). Internal to the kamal network — not host-published. Mirrors the prod split (heatwave-valkey-{cache,sessions,queue}).
mailpit — SMTP sink + web UI. App/sidekiq deliver to heatwave-mailpit:1025; the UI is bound to the Tailscale interface only (http://100.123.47.52:8025), so captured staging mail (reset tokens etc.) is never publicly exposed.

Production runs the same accessories, just split across two Latitude boxes: a PG18 primary in Dallas (heatwave-postgres) with a cross-DC streaming standby in Chicago (heatwave-postgres-replica), fronted by per-node pgbouncer and a TCP write-VIP HAProxy (heatwave-haproxy:6433, the app’s DATABASE_HOST) so a pg_promote flip reroutes with no app redeploy; the same 3-flavor Valkey split (heatwave-valkey-cache / -sessions / -queue); and Databasus PITR → Cloudflare R2 backups off the Chicago standby. The old Vultr Postgres (db4/db3) and Vultr Managed Valkey are gone. Full current topology, hosts, ports, and image tags: doc/infrastructure/INFRASTRUCTURE_INVENTORY.md and doc/tasks/202606112045_DB_TIER_HA_ARCHITECTURE.md.

Ingress & network

Cloudflare Tunnel (cloudflared, host systemd service, remotely managed — ingress configured in Cloudflare, not on the box). Outbound-only QUIC; the only inbound web path. Routes crm/www/api/mcp.warmlyyours.ws → http://localhost:80.
Cloudflare Access — SSO gate (the wy-employees group) in front of every staging hostname.
Tailscale — the admin/SSH plane (and, in the HA end-state, cross-region DB replication). Hosts get 100.x addresses; SSH is Tailscale-only.
Firewall, defense-in-depth — Latitude edge firewall (SSH from the Tailscale CGNAT range 100.64.0.0/10 only) + host UFW (default-deny inbound, allow lo + tailscale0 + :22) + a DOCKER-USER iptables chain that blocks public :80/:443 (Docker bypasses UFW for published ports) + Cloudflare Access.

Secrets

.kamal/secrets-common — shared: RAILS_MASTER_KEY (= config/master.key), BUNDLE_GEMS__CONTRIBSYS__COM (Sidekiq Pro), KAMAL_REGISTRY_PASSWORD (GHCR).
.kamal/secrets.staging — staging PG password + the staging Heatwave::Configuration env-key.
.kamal/secrets — prod PG password + production env-key (op://IT/Heatwave-Postgres must be created before cutover).
All three are resolver-only (Kamal’s 1Password adapter — no literal secrets) and therefore committed. See MANAGING.md → Secrets.

Provisioning (Infrastructure as Code)

infra/terraform/latitude/ — provisions a Latitude bare-metal box: SSH keys, cloud-init (deploy user uid 1001, Docker, Tailscale, UFW + DOCKER-USER, cloudflared), RAID-1, edge firewall.
infra/terraform/cloudflare/ — the tunnel (remotely managed) + DNS CNAMEs + Access app/policy for *.warmlyyours.ws.
infra/terraform/ (root) — the original Vultr provisioning module (being retired in favour of Latitude).

Deploy tooling & lifecycle hooks

bin/deploy — the wrapper around kamal deploy (clean-tree gate, 1Password unlock, gated migrations, sourcemap upload, edge-cache purge). See DEPLOYING.md.
.kamal/hooks/pre-build — stamps REVISION (git SHA) into the build context so webpack/AppSignal report a real revision.
.kamal/hooks/pre-deploy — quiets Sidekiq (TSTP) before the swap.
.kamal/hooks/post-deploy — clears REVISION + the Sidekiq quiet marker.
script/db_restore_kamal.sh — fast+deferred DB restore into the staging Postgres accessory (see MANAGING.md → Database restore).

Master architecture — staging (live)

flowchart TB
    user([User / browser])

    subgraph CF["Cloudflare edge"]
        tls["TLS termination<br/>+ WAF + cache"]
        access["Access SSO gate<br/>(wy-employees group)"]
        cft["Cloudflare Tunnel<br/>crm/www/api/mcp.warmlyyours.ws"]
    end

    subgraph BOX["Latitude bare-metal — dal-latitude-heatwave-01 (Tailscale 100.123.47.52)"]
        direction TB
        cfd["cloudflared<br/>(host systemd, outbound QUIC)"]
        proxy["kamal-proxy :80<br/>(rolling swap, /up healthcheck)"]

        subgraph NET["docker network: kamal"]
            direction TB
            web["web container<br/>Thruster :80 → Puma :3000"]
            sidekiq["sidekiq container<br/>consolidated capsules + scheduler"]
            pg[("postgres accessory<br/>PG18 · heatwave + heatwave_versions")]
            valkey[("valkey accessories ×3<br/>cache / sessions / queue")]
            mailpit["mailpit accessory<br/>SMTP :1025 / UI :8025"]
        end
    end

    admin([Operator]) -. "SSH / psql / mailpit UI<br/>over Tailscale" .-> BOX

    user -->|HTTPS| tls --> access --> cft
    cft -->|"QUIC (dialed out by cloudflared)"| cfd
    cfd -->|"http://localhost:80"| proxy --> web
    web --> pg & valkey
    web -->|SMTP| mailpit
    sidekiq --> pg & valkey

Request path: browser → Cloudflare (TLS, Access SSO) → Cloudflare Tunnel → cloudflared on the box → http://localhost:80 (kamal-proxy) → web container (Thruster :80 → Puma :3000). No inbound web ports are open on the host; the tunnel is dialed outbound.

Production topology — pre-cutover snapshot (historical)

Historical. This section and the diagram below capture the pre-cutover Vultr + Capistrano topology and the original Kamal target. Production cut over to Kamal on Latitude on 2026-06-07 (Dallas primary + Chicago standby); see the Status note at the top of this file and INFRASTRUCTURE_INVENTORY.md for the current state.

flowchart TB
    user([User]) -->|HTTPS| cf["Cloudflare edge<br/>(TLS + WAF + Access on CRM)"]
    cf -->|Tunnel| cfd["cloudflared (host)"]
    cfd -->|"localhost:80"| proxy["kamal-proxy"]

    subgraph WEB1["web1 (Vultr, Ubuntu 26.04) — TODO provision"]
        proxy --> web["web container (Puma)"]
        proxy -.-> sk["sidekiq container<br/>(consolidated, co-located to start)"]
    end

    web -->|"public IP 45.63.79.22:5432<br/>(firewall allowlist + SCRAM)"| db4[("db4 — Postgres PRIMARY<br/>heatwave + heatwave_versions")]
    db4 -. "async replication" .-> db3[("db3 — replica")]
    web -->|"TLS, allowlist"| valkey[("Vultr Managed Valkey 7")]
    sk --> db4 & valkey

Cutover prerequisites (gated): create op://IT/Heatwave-Postgres, provision web1, add its public IP to the db4 firewall group + the Valkey allowlist, then bin/deploy production. Full sequence in doc/tasks/202606022303_KAMAL_MIGRATION.md.

Network & security layers

flowchart LR
    subgraph internet["Public internet"]
        u([User]) ; op([Operator])
    end

    subgraph edge["Layer 1 — Cloudflare"]
        e1["TLS + WAF + rate limiting"]
        e2["Access SSO (Zero Trust)"]
    end

    subgraph latfw["Layer 2 — Latitude edge firewall"]
        l1["inbound :22 ← 100.64.0.0/10 only<br/>(Tailscale CGNAT) · default-deny"]
    end

    subgraph host["Layer 3 — host (UFW + DOCKER-USER)"]
        h1["UFW: default-deny in,<br/>allow lo + tailscale0 + :22"]
        h2["DOCKER-USER: DROP public :80/:443,<br/>RETURN on tailscale0"]
    end

    subgraph app["Layer 4 — app"]
        a1["accessories bound to 127.0.0.1<br/>or the Tailscale IP — never 0.0.0.0"]
        a2["web reachable only via the Tunnel"]
    end

    u -->|web| e1 --> e2 -->|"Tunnel (outbound)"| a2
    op -->|SSH/psql/UI| l1 --> h1 --> a1
    h2 --- a2

The web tier is reachable only through the Cloudflare Tunnel (no public port). The operator tier (SSH, psql, mailpit UI) is reachable only over Tailscale. DOCKER-USER exists because Docker inserts iptables rules ahead of UFW for published ports — without it, a published :80 would be world-reachable despite UFW’s default-deny.

Key facts at a glance

Thing	Value
Live staging host	`dal-latitude-heatwave-01`, Tailscale 100.123.47.52 (Latitude bare metal, RAID-1)
Staging hostnames	`crm` / `www` / `api` / `mcp`.warmlyyours.ws (TLD env = `warmlyyours.ws`)
Staging Access group	`wy-employees` (`0de0f290-f12c-4046-ae47-b66146f1a4ac`)
App image	`ghcr.io/warmlyyours/heatwave` (GHCR)
PG accessory image	`ghcr.io/warmlyyours/heatwave-postgres:18` (GHCR)
Docker network	`kamal` (app + accessories resolve by name)
Web port path	Cloudflare → tunnel → kamal-proxy `:80` → Thruster `:80` → Puma `:3000`
Deploy user	`deploy`, uid 1001 (must match container `USER 1001`)
Cloudflare account	`79b7f58cf035093b5ad11747df30369a`
Staging zone	`warmlyyours.ws` (`d39acaed475782c4901d4a8e5908c1cb`)
Prod DB	PG18 primary `heatwave-postgres` on Dallas (`100.123.47.52`) + cross-DC streaming standby `heatwave-postgres-replica` on Chicago (`100.68.157.49`); app reaches it via HAProxy write-VIP `heatwave-haproxy:6433` → pgbouncer. See `INFRASTRUCTURE_INVENTORY.md`
Prod cache/queue	Valkey ×3 — `heatwave-valkey-cache` / `-sessions` / `-queue` (3-flavor split, routed per logical DB)
Prod backups	Databasus PITR → Cloudflare R2 (off the Chicago standby)
Deploy command	`bin/deploy [staging