Skip to content

Managing the Kamal Stack — Day-2 Operations

Everything after the deploy: logs, console, accessories, database restores, secrets, mailpit, scaling, and provisioning a new box. For the deploy itself see DEPLOYING.md; for failures see TROUBLESHOOTING.md.

All kamal commands below assume the mise exec -- bundle exec kamal prefix (abbreviated kamal here). Add -d staging for the staging destination; omit it for production. SSH/psql/UI access to the box is over Tailscale only.


TaskCommand
Tail app logskamal app logs -d staging -f
Logs for one rolekamal app logs -d staging --roles=sidekiq -f
Rails consolekamal console -d staging (alias → app exec --interactive --reuse)
Shell in the containerkamal shell -d staging
DB consolekamal dbc -d staging
Deployed versionskamal app versions -d staging
Container statuskamal app details -d staging
Restart a rolekamal app boot --roles=web -d staging
Run a one-off taskkamal app exec -d staging --reuse 'bin/rails runner "…"'
Print resolved secretskamal secrets print -d staging

Direct on the box (over Tailscale): ssh deploy@100.123.47.52, then docker ps, docker logs <name>, docker stats.


The postgres, the three valkey flavors, and mailpit containers are accessories — Kamal manages them but they are not rebuilt on an app deploy. Lifecycle:

Terminal window
kamal accessory boot postgres -d staging # create/start (first run + after host reboot)
kamal accessory boot mailpit -d staging # one-time after first deploy
kamal accessory reboot valkey_cache -d staging # restart one flavor (also valkey_sessions / valkey_queue)
kamal accessory logs postgres -d staging -f
kamal accessory details -d staging # all accessories

Boot order after a host reboot. Accessories and the app come up independently; the app may crash-loop briefly until Postgres is ready. restart: unless-stopped retries it, so it self-heals — but if the app is down after a reboot, check the accessories first (docker ps).

Accessory config lives in config/deploy.staging.yml:

  • postgres — custom PG18 image, tuned cmd (shared_buffers=8GB, etc.), data on the pgdata host volume, published 127.0.0.1:5432.
  • valkey ×3valkey/valkey:9.1 in a 3-flavor split (parity with prod): heatwave-staging-valkey-cache (allkeys-lru), -sessions (noeviction), -queue (noeviction + AOF), each on its own conf in config/valkey/. The app routes to them per logical DB via REDIS_CACHE_HOST / REDIS_SESSIONS_HOST / REDIS_QUEUE_HOST (config/initializers/100_redis_config.rb) — there is no single REDIS_HOST. Internal to the kamal network, not host-published.
  • mailpit — bound to the Tailscale IP 100.123.47.52:8025 (UI) + internal :1025 (SMTP).

Staging data is refreshed from the newest prod dump (Databasus → Cloudflare R2, the backup-of-record; BACKUP_SOURCE=wasabi is a legacy fallback) using the fast + deferred strategy (never a naive full pg_restore — the communications hash index alone took 82 min on 8.7M rows and blocked everything).

Terminal window
# On the box (scp the script over), with the R2 bucket creds + the accessory PG password:
AWS_ACCESS_KEY_ID= AWS_SECRET_ACCESS_KEY= PGPASSWORD= ./db_restore_kamal.sh

What it does (script/db_restore_kamal.sh):

flowchart TB
a["download + decompress newest Databasus/R2 dump"] --> b
b["build TOCs: fast (skip large tables) + deferred (only large tables)"] --> c
c["schema-only restore → indexes build on EMPTY tables (instant)"] --> d
d["FAST data restore (core tables, -L fast_toc)"] --> e
e["refresh CRITICAL matviews (view_quote_bom_items) — verified, pre-swap"] --> f
f["swap heatwave_restore → heatwave + restart app ★ CORE DB LIVE"] --> g
g["DEFERRED: load large tables (-L deferred_toc)"] --> h
h["refresh analytics matviews (interruptible) + VACUUM ANALYZE"]

Key points:

  • SKIP_TABLES (deferred): visits, visit_events, communications, communication_recipients, communications_uploads, audit_trails, store_item_audits, data_imports, edi_communication_logs.
  • Critical vs. analytics matviews. view_quote_bom_items (the quote builder’s BOM source) is refreshed eagerly, pre-swap, and verified — an empty matview surfaces in the UI as “No matching controls”. The ~22 analytics matviews are refreshed after the swap (non-blocking) and self-heal via the hourly MatviewRefreshWorker cron if interrupted. (Mirrored in script/db_restore.sh for the dev/local restore.)
  • heatwave_versions stays schema-only on staging by default. Note its partitioned tables need annual child partitions created or first-write 500s with “no partition of relation versions found” — db/versions_structure.sql now carries them (pg_party schema-dump fix, PR #1031).
  • Flags: NO_SWAP=1 (build but don’t go live), NO_DEFERRED=1 (core only), KEEP_DUMP=1, BACKUP_FILE=…, APP_IMAGE=….

flowchart LR
subgraph files[".kamal/secrets* (resolver-only, committed)"]
common["secrets-common<br/>RAILS_MASTER_KEY · Sidekiq Pro · GHCR"]
stg["secrets.staging<br/>PG password · staging env-key"]
prod["secrets<br/>PG password · production env-key"]
end
adapter["kamal secrets fetch/extract<br/>(1Password adapter)"]
op[("1Password<br/>warmlyyours.1password.com · vault IT")]
mk["config/master.key (local)"]
common & stg & prod --> adapter --> op
common -. "RAILS_MASTER_KEY / env-key = cat" .-> mk

Model: the .kamal/secrets* files contain no literal secrets — only resolver expressions (kamal secrets fetch --adapter 1password … + kamal secrets extract …, and cat config/master.key). They are therefore committed to git. A fresh machine resolves everything with a signed-in 1Password (warmlyyours account, IT vault) plus a local config/master.key.

Always validate before deploying:

Terminal window
kamal secrets print -d staging # and `kamal secrets print` for prod

Extract-key gotcha. The adapter strips a trailing /password from the map key — a …/password field extracts as IT/<Item> (no /password); other fields (/credential) keep the field name. This is why the secrets files read extract IT/Heatwave-Staging-Postgres (not …/password).

ItemUsed for
Sidekiq-Pro/credentialBUNDLE_GEMS__CONTRIBSYS__COM (build-time gem auth)
GitHub-ghcr-deploy/credentialKAMAL_REGISTRY_PASSWORD (GHCR push/pull)
Heatwave-Staging-Postgres/passwordstaging PG accessory + app DB password
Heatwave-Postgres/passwordprod DB password — create before cutover
AppSignal-account-push-key/credentialpost-deploy sourcemap upload (account-wide key; the site key 401s)
Tailscale-Kamal/credentialcloud-init Tailscale auth key
Cloudflare-Account-API-Token/credentialtunnel/DNS/Access Terraform
Latitude-API/credentialbare-metal host provisioning

Service-account token (headless / CI / flaky desktop app)

Section titled “Service-account token (headless / CI / flaky desktop app)”

The 1Password desktop-app CLI integration occasionally fails with “couldn’t connect to the 1Password desktop app”. The robust path is a service-account token — no desktop app, no biometric:

Terminal window
# Save the token to a gitignored file scoped to deploys (NOT your interactive shell):
printf '%s' '<token>' > .kamal/.op-service-account-token # gitignored
# bin/deploy reads it automatically; CI can export OP_SERVICE_ACCOUNT_TOKEN instead.

bin/deploy’s op_session() short-circuits to the token when present. Rotate by overwriting the file. (bin/setup can populate .env.mcp.local from op://IT/1password-heatwave-ops.)


Staging captures all outbound mail in mailpit instead of sending for real (reset tokens, the noisy scheduler/Sidekiq mail, campaigns):

  • UI: http://100.123.47.52:8025 (Tailscale only — never public).
  • App/sidekiq deliver to heatwave-mailpit:1025 over the kamal network (config/environments/staging.rb); config.x.mailpit_url drives the admin/campaign UI links.
  • One-time after the first deploy: kamal accessory boot mailpit -d staging.
  • To escape to real SendGrid for a test, the staging mailer honours SEND_FOR_REAL=y.

A single consolidated container (SIDEKIQ_CONSOLIDATED=1) runs every queue class via capsules (high/low/campaign at concurrency 9/10/10), the default set, and the scheduler — see config/initializers/sidekiq.rb + config/sidekiq.yml.

Terminal window
kamal app logs --roles=sidekiq -d staging -f
kamal app boot --roles=sidekiq -d staging # restart (un-quiet) the worker
  • Rolling deploys quiet it (TSTP) via .kamal/hooks/pre-deploy; super_fetch recovers in-flight jobs, so no job is lost on a swap.
  • To split queue classes back onto separate hosts later, restore one role per config (sidekiq_high.yml, …) and drop SIDEKIQ_CONSOLIDATED (else a queue would be served by both a capsule and a dedicated process).

Bulk operations (>1000 records / jobs) follow the count-first, two-confirmation protocol in CLAUDE.md — a careless mass-enqueue against the shared :default queue is hard to undo. Surface the count before enqueuing.


  • Web concurrencyPUMA_WORKERS / WEB_CONCURRENCY (4) + thread counts in config/deploy.yml env.clear. Tune for the shared box.
  • Add a host to a role — add its IP under servers.<role>.hosts and redeploy. kamal-proxy on each host load-balances independently behind the tunnel.
  • GCRUBY_GC_* heap-tuning envs, carried over from the pre-Kamal Puma config.
  • Postgres — the staging accessory cmd in config/deploy.staging.yml is tuned down (shared_buffers=8GB, effective_cache_size=24GB) because the box (192 GB) is shared with the co-located prod stack; prod PG18 gets its own full-size tuning.

Provisioning a new box (Terraform / OpenTofu)

Section titled “Provisioning a new box (Terraform / OpenTofu)”

Two decoupled modules under infra/terraform/. Use OpenTofu (tofu).

flowchart TB
subgraph cfmod["infra/terraform/cloudflare/"]
t1["tunnel (remotely-managed)"] --> t2["DNS CNAMEs → *.cfargotunnel.com"]
t1 --> t3["Access app + policy (wy-employees)"]
t1 --> tok["output: tunnel_token (sensitive)"]
end
subgraph latmod["infra/terraform/latitude/"]
l1["SSH keys (files/authorized_keys)"] --> l2["latitudesh_server (RAID-1)"]
l3["cloud-init: deploy uid 1001 · Docker · Tailscale ·<br/>UFW + DOCKER-USER · cloudflared"] --> l2
l4["edge firewall: :22 ← 100.64.0.0/10"] --> l2
end
tok -->|"-var cloudflared_token=…"| l3
Terminal window
# 1. Cloudflare side (tunnel + DNS + Access) — CLOUDFLARE_API_TOKEN via direnv:
cd infra/terraform/cloudflare && tofu init && tofu apply
# 2. Latitude box, wired to that tunnel:
cd ../latitude
export LATITUDESH_AUTH_TOKEN="$(op read op://IT/Latitude-API/credential)"
tofu init && tofu apply \
-var project=<id> \
-var hostname=<name> \
-var tailscale_auth_key="$(op read op://IT/Tailscale-Kamal/credential)" \
-var cloudflared_token="$(tofu -chdir=../cloudflare output -raw tunnel_token)"

cloud-init yields a fully-wired box (Docker + Tailscale + cloudflared + UFW + DOCKER-USER). The deploy user is pinned to uid 1001 so it matches the container’s USER 1001 and can own Kamal’s asset_path bind-mount (otherwise the post-deploy DELETE_MAPS sourcemap cleanup fails with EACCES). Then bin/deploy -d <dest>; the server bootstrap is a near no-op.

The current staging box (dal-latitude-heatwave-01, f4-metal-medium / Ubuntu 26.04 / ZFS data plane) was provisioned via this module (infra/terraform/latitude, setup_zfs_data=true) — it’s the reproducible recipe in use. The earlier hand-built Ashburn box it replaced has been decommissioned. To adopt an already-running box into state instead of rebuilding, tofu import latitudesh_server.host <id> and install cloudflared by hand once (cloud-init won’t retroactively run).

See doc/tasks/202606112045_DB_TIER_HA_ARCHITECTURE.md for the current two-region HA topology (PG18 primary in Dallas + cross-DC streaming standby in Chicago, fronted by per-node pgbouncer + the HAProxy write-VIP heatwave-haproxy:6433, with pg_promote-driven failover). INFRASTRUCTURE_INVENTORY.md is the live host/port reference. (The older …202606041041_BARE_METAL_HA_STACK.md described a Chicago-primary / Ashburn-standby end-state and is superseded.)