Managing the Kamal Stack — Day-2 Operations
Everything after the deploy: logs, console, accessories, database restores, secrets, mailpit, scaling, and provisioning a new box. For the deploy itself see DEPLOYING.md; for failures see TROUBLESHOOTING.md.
All kamal commands below assume the mise exec -- bundle exec kamal prefix
(abbreviated kamal here). Add -d staging for the staging destination; omit it
for production. SSH/psql/UI access to the box is over Tailscale only.
Everyday commands
Section titled “Everyday commands”| Task | Command |
|---|---|
| Tail app logs | kamal app logs -d staging -f |
| Logs for one role | kamal app logs -d staging --roles=sidekiq -f |
| Rails console | kamal console -d staging (alias → app exec --interactive --reuse) |
| Shell in the container | kamal shell -d staging |
| DB console | kamal dbc -d staging |
| Deployed versions | kamal app versions -d staging |
| Container status | kamal app details -d staging |
| Restart a role | kamal app boot --roles=web -d staging |
| Run a one-off task | kamal app exec -d staging --reuse 'bin/rails runner "…"' |
| Print resolved secrets | kamal secrets print -d staging |
Direct on the box (over Tailscale): ssh deploy@100.123.47.52, then
docker ps, docker logs <name>, docker stats.
Accessories (staging)
Section titled “Accessories (staging)”The postgres, the three valkey flavors, and mailpit containers are
accessories — Kamal manages them but they are not rebuilt on an app
deploy. Lifecycle:
kamal accessory boot postgres -d staging # create/start (first run + after host reboot)kamal accessory boot mailpit -d staging # one-time after first deploykamal accessory reboot valkey_cache -d staging # restart one flavor (also valkey_sessions / valkey_queue)kamal accessory logs postgres -d staging -fkamal accessory details -d staging # all accessoriesBoot order after a host reboot. Accessories and the app come up independently; the app may crash-loop briefly until Postgres is ready.
restart: unless-stoppedretries it, so it self-heals — but if the app is down after a reboot, check the accessories first (docker ps).
Accessory config lives in config/deploy.staging.yml:
- postgres — custom PG18 image, tuned
cmd(shared_buffers=8GB, etc.), data on thepgdatahost volume, published127.0.0.1:5432. - valkey ×3 —
valkey/valkey:9.1in a 3-flavor split (parity with prod):heatwave-staging-valkey-cache(allkeys-lru),-sessions(noeviction),-queue(noeviction+ AOF), each on its own conf inconfig/valkey/. The app routes to them per logical DB viaREDIS_CACHE_HOST/REDIS_SESSIONS_HOST/REDIS_QUEUE_HOST(config/initializers/100_redis_config.rb) — there is no singleREDIS_HOST. Internal to thekamalnetwork, not host-published. - mailpit — bound to the Tailscale IP
100.123.47.52:8025(UI) + internal:1025(SMTP).
Database restore
Section titled “Database restore”Staging data is refreshed from the newest prod dump (Databasus → Cloudflare R2,
the backup-of-record; BACKUP_SOURCE=wasabi is a legacy fallback) using the
fast + deferred strategy (never a naive full pg_restore — the
communications hash index alone took 82 min on 8.7M rows and blocked everything).
# On the box (scp the script over), with the R2 bucket creds + the accessory PG password:AWS_ACCESS_KEY_ID=… AWS_SECRET_ACCESS_KEY=… PGPASSWORD=… ./db_restore_kamal.shWhat it does (script/db_restore_kamal.sh):
flowchart TB a["download + decompress newest Databasus/R2 dump"] --> b b["build TOCs: fast (skip large tables) + deferred (only large tables)"] --> c c["schema-only restore → indexes build on EMPTY tables (instant)"] --> d d["FAST data restore (core tables, -L fast_toc)"] --> e e["refresh CRITICAL matviews (view_quote_bom_items) — verified, pre-swap"] --> f f["swap heatwave_restore → heatwave + restart app ★ CORE DB LIVE"] --> g g["DEFERRED: load large tables (-L deferred_toc)"] --> h h["refresh analytics matviews (interruptible) + VACUUM ANALYZE"]Key points:
SKIP_TABLES(deferred):visits, visit_events, communications, communication_recipients, communications_uploads, audit_trails, store_item_audits, data_imports, edi_communication_logs.- Critical vs. analytics matviews.
view_quote_bom_items(the quote builder’s BOM source) is refreshed eagerly, pre-swap, and verified — an empty matview surfaces in the UI as “No matching controls”. The ~22 analytics matviews are refreshed after the swap (non-blocking) and self-heal via the hourlyMatviewRefreshWorkercron if interrupted. (Mirrored inscript/db_restore.shfor the dev/local restore.) heatwave_versionsstays schema-only on staging by default. Note its partitioned tables need annual child partitions created or first-write 500s with “no partition of relation versions found” —db/versions_structure.sqlnow carries them (pg_party schema-dump fix, PR #1031).- Flags:
NO_SWAP=1(build but don’t go live),NO_DEFERRED=1(core only),KEEP_DUMP=1,BACKUP_FILE=…,APP_IMAGE=….
Secrets
Section titled “Secrets”flowchart LR subgraph files[".kamal/secrets* (resolver-only, committed)"] common["secrets-common<br/>RAILS_MASTER_KEY · Sidekiq Pro · GHCR"] stg["secrets.staging<br/>PG password · staging env-key"] prod["secrets<br/>PG password · production env-key"] end adapter["kamal secrets fetch/extract<br/>(1Password adapter)"] op[("1Password<br/>warmlyyours.1password.com · vault IT")] mk["config/master.key (local)"]
common & stg & prod --> adapter --> op common -. "RAILS_MASTER_KEY / env-key = cat" .-> mkModel: the .kamal/secrets* files contain no literal secrets — only
resolver expressions (kamal secrets fetch --adapter 1password … + kamal secrets extract …, and cat config/master.key). They are therefore committed to git.
A fresh machine resolves everything with a signed-in 1Password (warmlyyours
account, IT vault) plus a local config/master.key.
Always validate before deploying:
kamal secrets print -d staging # and `kamal secrets print` for prodExtract-key gotcha. The adapter strips a trailing
/passwordfrom the map key — a…/passwordfield extracts asIT/<Item>(no/password); other fields (/credential) keep the field name. This is why the secrets files readextract IT/Heatwave-Staging-Postgres(not…/password).
1Password items (vault IT)
Section titled “1Password items (vault IT)”| Item | Used for |
|---|---|
Sidekiq-Pro/credential | BUNDLE_GEMS__CONTRIBSYS__COM (build-time gem auth) |
GitHub-ghcr-deploy/credential | KAMAL_REGISTRY_PASSWORD (GHCR push/pull) |
Heatwave-Staging-Postgres/password | staging PG accessory + app DB password |
Heatwave-Postgres/password | prod DB password — create before cutover |
AppSignal-account-push-key/credential | post-deploy sourcemap upload (account-wide key; the site key 401s) |
Tailscale-Kamal/credential | cloud-init Tailscale auth key |
Cloudflare-Account-API-Token/credential | tunnel/DNS/Access Terraform |
Latitude-API/credential | bare-metal host provisioning |
Service-account token (headless / CI / flaky desktop app)
Section titled “Service-account token (headless / CI / flaky desktop app)”The 1Password desktop-app CLI integration occasionally fails with “couldn’t connect to the 1Password desktop app”. The robust path is a service-account token — no desktop app, no biometric:
# Save the token to a gitignored file scoped to deploys (NOT your interactive shell):printf '%s' '<token>' > .kamal/.op-service-account-token # gitignored# bin/deploy reads it automatically; CI can export OP_SERVICE_ACCOUNT_TOKEN instead.bin/deploy’s op_session() short-circuits to the token when present. Rotate by
overwriting the file. (bin/setup can populate .env.mcp.local from
op://IT/1password-heatwave-ops.)
Email (mailpit)
Section titled “Email (mailpit)”Staging captures all outbound mail in mailpit instead of sending for real (reset tokens, the noisy scheduler/Sidekiq mail, campaigns):
- UI:
http://100.123.47.52:8025(Tailscale only — never public). - App/sidekiq deliver to
heatwave-mailpit:1025over thekamalnetwork (config/environments/staging.rb);config.x.mailpit_urldrives the admin/campaign UI links. - One-time after the first deploy:
kamal accessory boot mailpit -d staging. - To escape to real SendGrid for a test, the staging mailer honours
SEND_FOR_REAL=y.
Sidekiq
Section titled “Sidekiq”A single consolidated container (SIDEKIQ_CONSOLIDATED=1) runs every queue
class via capsules (high/low/campaign at concurrency 9/10/10), the default set,
and the scheduler — see config/initializers/sidekiq.rb + config/sidekiq.yml.
kamal app logs --roles=sidekiq -d staging -fkamal app boot --roles=sidekiq -d staging # restart (un-quiet) the worker- Rolling deploys quiet it (TSTP) via
.kamal/hooks/pre-deploy;super_fetchrecovers in-flight jobs, so no job is lost on a swap. - To split queue classes back onto separate hosts later, restore one role per
config (
sidekiq_high.yml, …) and dropSIDEKIQ_CONSOLIDATED(else a queue would be served by both a capsule and a dedicated process).
Bulk operations (>1000 records / jobs) follow the count-first, two-confirmation protocol in
CLAUDE.md— a careless mass-enqueue against the shared:defaultqueue is hard to undo. Surface the count before enqueuing.
Scaling & tuning
Section titled “Scaling & tuning”- Web concurrency —
PUMA_WORKERS/WEB_CONCURRENCY(4) + thread counts inconfig/deploy.yml env.clear. Tune for the shared box. - Add a host to a role — add its IP under
servers.<role>.hostsand redeploy. kamal-proxy on each host load-balances independently behind the tunnel. - GC —
RUBY_GC_*heap-tuning envs, carried over from the pre-Kamal Puma config. - Postgres — the staging accessory
cmdinconfig/deploy.staging.ymlis tuned down (shared_buffers=8GB,effective_cache_size=24GB) because the box (192 GB) is shared with the co-located prod stack; prod PG18 gets its own full-size tuning.
Provisioning a new box (Terraform / OpenTofu)
Section titled “Provisioning a new box (Terraform / OpenTofu)”Two decoupled modules under infra/terraform/. Use OpenTofu (tofu).
flowchart TB subgraph cfmod["infra/terraform/cloudflare/"] t1["tunnel (remotely-managed)"] --> t2["DNS CNAMEs → *.cfargotunnel.com"] t1 --> t3["Access app + policy (wy-employees)"] t1 --> tok["output: tunnel_token (sensitive)"] end subgraph latmod["infra/terraform/latitude/"] l1["SSH keys (files/authorized_keys)"] --> l2["latitudesh_server (RAID-1)"] l3["cloud-init: deploy uid 1001 · Docker · Tailscale ·<br/>UFW + DOCKER-USER · cloudflared"] --> l2 l4["edge firewall: :22 ← 100.64.0.0/10"] --> l2 end tok -->|"-var cloudflared_token=…"| l3# 1. Cloudflare side (tunnel + DNS + Access) — CLOUDFLARE_API_TOKEN via direnv:cd infra/terraform/cloudflare && tofu init && tofu apply
# 2. Latitude box, wired to that tunnel:cd ../latitudeexport LATITUDESH_AUTH_TOKEN="$(op read op://IT/Latitude-API/credential)"tofu init && tofu apply \ -var project=<id> \ -var hostname=<name> \ -var tailscale_auth_key="$(op read op://IT/Tailscale-Kamal/credential)" \ -var cloudflared_token="$(tofu -chdir=../cloudflare output -raw tunnel_token)"cloud-init yields a fully-wired box (Docker + Tailscale + cloudflared + UFW +
DOCKER-USER). The deploy user is pinned to uid 1001 so it matches the
container’s USER 1001 and can own Kamal’s asset_path bind-mount (otherwise the
post-deploy DELETE_MAPS sourcemap cleanup fails with EACCES). Then
bin/deploy -d <dest>; the server bootstrap is a near no-op.
The current staging box (
dal-latitude-heatwave-01, f4-metal-medium / Ubuntu 26.04 / ZFS data plane) was provisioned via this module (infra/terraform/latitude,setup_zfs_data=true) — it’s the reproducible recipe in use. The earlier hand-built Ashburn box it replaced has been decommissioned. To adopt an already-running box into state instead of rebuilding,tofu import latitudesh_server.host <id>and install cloudflared by hand once (cloud-init won’t retroactively run).
See doc/tasks/202606112045_DB_TIER_HA_ARCHITECTURE.md for the current two-region
HA topology (PG18 primary in Dallas + cross-DC streaming standby in Chicago,
fronted by per-node pgbouncer + the HAProxy write-VIP heatwave-haproxy:6433, with
pg_promote-driven failover). INFRASTRUCTURE_INVENTORY.md is the live host/port
reference. (The older …202606041041_BARE_METAL_HA_STACK.md described a
Chicago-primary / Ashburn-standby end-state and is superseded.)