Managing the Kamal Stack — Day-2 Operations

Everything after the deploy: logs, console, accessories, database restores, secrets, mailpit, scaling, and provisioning a new box. For the deploy itself see DEPLOYING.md; for failures see TROUBLESHOOTING.md.

All kamal commands below assume the mise exec -- bundle exec kamal prefix (abbreviated kamal here). Add -d staging for the staging destination; omit it for production. SSH/psql/UI access to the box is over Tailscale only.

Everyday commands

Task	Command
Tail app logs	`kamal app logs -d staging -f`
Logs for one role	`kamal app logs -d staging --roles=sidekiq -f`
Rails console	`kamal console -d staging` (alias → `app exec --interactive --reuse`)
Shell in the container	`kamal shell -d staging`
DB console	`kamal dbc -d staging`
Deployed versions	`kamal app versions -d staging`
Container status	`kamal app details -d staging`
Restart a role	`kamal app boot --roles=web -d staging`
Run a one-off task	`kamal app exec -d staging --reuse 'bin/rails runner "…"'`
Print resolved secrets	`kamal secrets print -d staging`

Direct on the box (over Tailscale): ssh deploy@100.123.47.52, then docker ps, docker logs <name>, docker stats.

Accessories (staging)

The postgres, the three valkey flavors, and mailpit containers are accessories — Kamal manages them but they are not rebuilt on an app deploy. Lifecycle:

kamal accessory boot   postgres       -d staging   # create/start (first run + after host reboot)
kamal accessory boot   mailpit        -d staging   # one-time after first deploy
kamal accessory reboot valkey_cache   -d staging   # restart one flavor (also valkey_sessions / valkey_queue)
kamal accessory logs   postgres       -d staging -f
kamal accessory details              -d staging   # all accessories

Boot order after a host reboot. Accessories and the app come up independently; the app may crash-loop briefly until Postgres is ready. restart: unless-stopped retries it, so it self-heals — but if the app is down after a reboot, check the accessories first (docker ps).

Accessory config lives in config/deploy.staging.yml:

postgres — custom PG18 image, tuned cmd (shared_buffers=8GB, etc.), data on the pgdata host volume, published 127.0.0.1:5432.
valkey ×3 — valkey/valkey:9.1 in a 3-flavor split (parity with prod): heatwave-staging-valkey-cache (allkeys-lru), -sessions (noeviction), -queue (noeviction + AOF), each on its own conf in config/valkey/. The app routes to them per logical DB via REDIS_CACHE_HOST / REDIS_SESSIONS_HOST / REDIS_QUEUE_HOST (config/initializers/100_redis_config.rb) — there is no single REDIS_HOST. Internal to the kamal network, not host-published.
mailpit — bound to the Tailscale IP 100.123.47.52:8025 (UI) + internal :1025 (SMTP).

Database restore

Staging data is refreshed from the newest prod dump (Databasus → Cloudflare R2, the backup-of-record; BACKUP_SOURCE=wasabi is a legacy fallback) using the fast + deferred strategy (never a naive full pg_restore — the communications hash index alone took 82 min on 8.7M rows and blocked everything).

# On the box (scp the script over), with the R2 bucket creds + the accessory PG password:
AWS_ACCESS_KEY_ID=… AWS_SECRET_ACCESS_KEY=… PGPASSWORD=… ./db_restore_kamal.sh

What it does (script/db_restore_kamal.sh):

flowchart TB
    a["download + decompress newest Databasus/R2 dump"] --> b
    b["build TOCs: fast (skip large tables) + deferred (only large tables)"] --> c
    c["schema-only restore → indexes build on EMPTY tables (instant)"] --> d
    d["FAST data restore (core tables, -L fast_toc)"] --> e
    e["refresh CRITICAL matviews (view_quote_bom_items) — verified, pre-swap"] --> f
    f["swap heatwave_restore → heatwave + restart app  ★ CORE DB LIVE"] --> g
    g["DEFERRED: load large tables (-L deferred_toc)"] --> h
    h["refresh analytics matviews (interruptible) + VACUUM ANALYZE"]

Key points:

SKIP_TABLES (deferred): visits, visit_events, communications, communication_recipients, communications_uploads, audit_trails, store_item_audits, data_imports, edi_communication_logs.
Critical vs. analytics matviews. view_quote_bom_items (the quote builder’s BOM source) is refreshed eagerly, pre-swap, and verified — an empty matview surfaces in the UI as “No matching controls”. The ~22 analytics matviews are refreshed after the swap (non-blocking) and self-heal via the hourly MatviewRefreshWorker cron if interrupted. (Mirrored in script/db_restore.sh for the dev/local restore.)
heatwave_versions stays schema-only on staging by default. Note its partitioned tables need annual child partitions created or first-write 500s with “no partition of relation versions found” — db/versions_structure.sql now carries them (pg_party schema-dump fix, PR #1031).
Flags: NO_SWAP=1 (build but don’t go live), NO_DEFERRED=1 (core only), KEEP_DUMP=1, BACKUP_FILE=…, APP_IMAGE=….

Secrets

flowchart LR
    subgraph files[".kamal/secrets* (resolver-only, committed)"]
        common["secrets-common<br/>RAILS_MASTER_KEY · Sidekiq Pro · GHCR"]
        stg["secrets.staging<br/>PG password · staging env-key"]
        prod["secrets<br/>PG password · production env-key"]
    end
    adapter["kamal secrets fetch/extract<br/>(1Password adapter)"]
    op[("1Password<br/>warmlyyours.1password.com · vault IT")]
    mk["config/master.key (local)"]

    common & stg & prod --> adapter --> op
    common -. "RAILS_MASTER_KEY / env-key = cat" .-> mk

Model: the .kamal/secrets* files contain no literal secrets — only resolver expressions (kamal secrets fetch --adapter 1password … + kamal secrets extract …, and cat config/master.key). They are therefore committed to git. A fresh machine resolves everything with a signed-in 1Password (warmlyyours account, IT vault) plus a local config/master.key.

Always validate before deploying:

kamal secrets print -d staging      # and `kamal secrets print` for prod

Extract-key gotcha. The adapter strips a trailing /password from the map key — a …/password field extracts as IT/<Item> (no /password); other fields (/credential) keep the field name. This is why the secrets files read extract IT/Heatwave-Staging-Postgres (not …/password).

1Password items (vault IT)

Item	Used for
`Sidekiq-Pro/credential`	`BUNDLE_GEMS__CONTRIBSYS__COM` (build-time gem auth)
`GitHub-ghcr-deploy/credential`	`KAMAL_REGISTRY_PASSWORD` (GHCR push/pull)
`Heatwave-Staging-Postgres/password`	staging PG accessory + app DB password
`Heatwave-Postgres/password`	prod DB password — create before cutover
`AppSignal-account-push-key/credential`	post-deploy sourcemap upload (account-wide key; the site key 401s)
`Tailscale-Kamal/credential`	cloud-init Tailscale auth key
`Cloudflare-Account-API-Token/credential`	tunnel/DNS/Access Terraform
`Latitude-API/credential`	bare-metal host provisioning

Service-account token (headless / CI / flaky desktop app)

The 1Password desktop-app CLI integration occasionally fails with “couldn’t connect to the 1Password desktop app”. The robust path is a service-account token — no desktop app, no biometric:

# Save the token to a gitignored file scoped to deploys (NOT your interactive shell):
printf '%s' '<token>' > .kamal/.op-service-account-token   # gitignored
# bin/deploy reads it automatically; CI can export OP_SERVICE_ACCOUNT_TOKEN instead.

bin/deploy’s op_session() short-circuits to the token when present. Rotate by overwriting the file. (bin/setup can populate .env.mcp.local from op://IT/1password-heatwave-ops.)

Email (mailpit)

Staging captures all outbound mail in mailpit instead of sending for real (reset tokens, the noisy scheduler/Sidekiq mail, campaigns):

UI: http://100.123.47.52:8025 (Tailscale only — never public).
App/sidekiq deliver to heatwave-mailpit:1025 over the kamal network (config/environments/staging.rb); config.x.mailpit_url drives the admin/campaign UI links.
One-time after the first deploy: kamal accessory boot mailpit -d staging.
To escape to real SendGrid for a test, the staging mailer honours SEND_FOR_REAL=y.

Sidekiq

A single consolidated container (SIDEKIQ_CONSOLIDATED=1) runs every queue class via capsules (high/low/campaign at concurrency 9/10/10), the default set, and the scheduler — see config/initializers/sidekiq.rb + config/sidekiq.yml.

kamal app logs --roles=sidekiq -d staging -f
kamal app boot --roles=sidekiq -d staging      # restart (un-quiet) the worker

Rolling deploys quiet it (TSTP) via .kamal/hooks/pre-deploy; super_fetch recovers in-flight jobs, so no job is lost on a swap.
To split queue classes back onto separate hosts later, restore one role per config (sidekiq_high.yml, …) and drop SIDEKIQ_CONSOLIDATED (else a queue would be served by both a capsule and a dedicated process).

Bulk operations (>1000 records / jobs) follow the count-first, two-confirmation protocol in CLAUDE.md — a careless mass-enqueue against the shared :default queue is hard to undo. Surface the count before enqueuing.

Scaling & tuning

Web concurrency — PUMA_WORKERS / WEB_CONCURRENCY (4) + thread counts in config/deploy.yml env.clear. Tune for the shared box.
Add a host to a role — add its IP under servers.<role>.hosts and redeploy. kamal-proxy on each host load-balances independently behind the tunnel.
GC — RUBY_GC_* heap-tuning envs, carried over from the pre-Kamal Puma config.
Postgres — the staging accessory cmd in config/deploy.staging.yml is tuned down (shared_buffers=8GB, effective_cache_size=24GB) because the box (192 GB) is shared with the co-located prod stack; prod PG18 gets its own full-size tuning.

Provisioning a new box (Terraform / OpenTofu)

Two decoupled modules under infra/terraform/. Use OpenTofu (tofu).

flowchart TB
    subgraph cfmod["infra/terraform/cloudflare/"]
        t1["tunnel (remotely-managed)"] --> t2["DNS CNAMEs → *.cfargotunnel.com"]
        t1 --> t3["Access app + policy (wy-employees)"]
        t1 --> tok["output: tunnel_token (sensitive)"]
    end
    subgraph latmod["infra/terraform/latitude/"]
        l1["SSH keys (files/authorized_keys)"] --> l2["latitudesh_server (RAID-1)"]
        l3["cloud-init: deploy uid 1001 · Docker · Tailscale ·<br/>UFW + DOCKER-USER · cloudflared"] --> l2
        l4["edge firewall: :22 ← 100.64.0.0/10"] --> l2
    end
    tok -->|"-var cloudflared_token=…"| l3

# 1. Cloudflare side (tunnel + DNS + Access) — CLOUDFLARE_API_TOKEN via direnv:
cd infra/terraform/cloudflare && tofu init && tofu apply

# 2. Latitude box, wired to that tunnel:
cd ../latitude
export LATITUDESH_AUTH_TOKEN="$(op read op://IT/Latitude-API/credential)"
tofu init && tofu apply \
  -var project=<id> \
  -var hostname=<name> \
  -var tailscale_auth_key="$(op read op://IT/Tailscale-Kamal/credential)" \
  -var cloudflared_token="$(tofu -chdir=../cloudflare output -raw tunnel_token)"

cloud-init yields a fully-wired box (Docker + Tailscale + cloudflared + UFW + DOCKER-USER). The deploy user is pinned to uid 1001 so it matches the container’s USER 1001 and can own Kamal’s asset_path bind-mount (otherwise the post-deploy DELETE_MAPS sourcemap cleanup fails with EACCES). Then bin/deploy -d <dest>; the server bootstrap is a near no-op.

The current staging box (dal-latitude-heatwave-01, f4-metal-medium / Ubuntu 26.04 / ZFS data plane) was provisioned via this module (infra/terraform/latitude, setup_zfs_data=true) — it’s the reproducible recipe in use. The earlier hand-built Ashburn box it replaced has been decommissioned. To adopt an already-running box into state instead of rebuilding, tofu import latitudesh_server.host <id> and install cloudflared by hand once (cloud-init won’t retroactively run).

See doc/tasks/202606112045_DB_TIER_HA_ARCHITECTURE.md for the current two-region HA topology (PG18 primary in Dallas + cross-DC streaming standby in Chicago, fronted by per-node pgbouncer + the HAProxy write-VIP heatwave-haproxy:6433, with pg_promote-driven failover). INFRASTRUCTURE_INVENTORY.md is the live host/port reference. (The older …202606041041_BARE_METAL_HA_STACK.md described a Chicago-primary / Ashburn-standby end-state and is superseded.)