Managing the Kamal Stack — Day-2 Operations
Everything after the deploy: logs, console, accessories, database restores,
secrets, mailpit, scaling, and provisioning a new box. For the deploy itself see
DEPLOYING.md; for failures see TROUBLESHOOTING.md.
All kamal commands below assume the mise exec -- bundle exec kamal prefix
(abbreviated kamal here). Add -d staging for the staging destination; omit it
for production. SSH/psql/UI access to the box is over Tailscale only.
Everyday commands
| Task | Command |
|---|---|
| Tail app logs | kamal app logs -d staging -f |
| Logs for one role | kamal app logs -d staging --roles=sidekiq -f |
| Rails console | kamal console -d staging (alias → app exec --interactive --reuse) |
| Shell in the container | kamal shell -d staging |
| DB console | kamal dbc -d staging |
| Deployed versions | kamal app versions -d staging |
| Container status | kamal app details -d staging |
| Restart a role | kamal app boot --roles=web -d staging |
| Run a one-off task | kamal app exec -d staging --reuse 'bin/rails runner "…"' |
| Print resolved secrets | kamal secrets print -d staging |
Direct on the box (over Tailscale): ssh deploy@100.123.47.52, then
docker ps, docker logs <name>, docker stats.
Accessories (staging)
The postgres, the three valkey flavors, and mailpit containers are
accessories — Kamal manages them but they are not rebuilt on an app
deploy. Lifecycle:
kamal accessory boot postgres -d staging # create/start (first run + after host reboot)
kamal accessory boot mailpit -d staging # one-time after first deploy
kamal accessory reboot valkey_cache -d staging # restart one flavor (also valkey_sessions / valkey_queue)
kamal accessory logs postgres -d staging -f
kamal accessory details -d staging # all accessories
Boot order after a host reboot. Accessories and the app come up
independently; the app may crash-loop briefly until Postgres is ready.
restart: unless-stoppedretries it, so it self-heals — but if the app is down
after a reboot, check the accessories first (docker ps).
Accessory config lives in config/deploy.staging.yml:
- postgres — custom PG18 image, tuned
cmd(shared_buffers=8GB, etc.), data on
thepgdatahost volume, published127.0.0.1:5432. - valkey ×3 —
valkey/valkey:9.1in a 3-flavor split (parity with prod):
heatwave-staging-valkey-cache(allkeys-lru),-sessions(noeviction),
-queue(noeviction+ AOF), each on its own conf inconfig/valkey/. The app
routes to them per logical DB viaREDIS_CACHE_HOST/REDIS_SESSIONS_HOST/
REDIS_QUEUE_HOST(config/initializers/100_redis_config.rb) — there is no
singleREDIS_HOST. Internal to thekamalnetwork, not host-published. - mailpit — bound to the Tailscale IP
100.123.47.52:8025(UI) + internal
:1025(SMTP).
Database restore
Staging data is refreshed from the newest prod dump (Databasus → Cloudflare R2,
the backup-of-record; BACKUP_SOURCE=wasabi is a legacy fallback) using the
fast + deferred strategy (never a naive full pg_restore — the
communications hash index alone took 82 min on 8.7M rows and blocked everything).
# On the box (scp the script over), with the R2 bucket creds + the accessory PG password:
AWS_ACCESS_KEY_ID=… AWS_SECRET_ACCESS_KEY=… PGPASSWORD=… ./db_restore_kamal.sh
What it does (script/db_restore_kamal.sh):
flowchart TB
a["download + decompress newest Databasus/R2 dump"] --> b
b["build TOCs: fast (skip large tables) + deferred (only large tables)"] --> c
c["schema-only restore → indexes build on EMPTY tables (instant)"] --> d
d["FAST data restore (core tables, -L fast_toc)"] --> e
e["refresh CRITICAL matviews (view_quote_bom_items) — verified, pre-swap"] --> f
f["swap heatwave_restore → heatwave + restart app ★ CORE DB LIVE"] --> g
g["DEFERRED: load large tables (-L deferred_toc)"] --> h
h["refresh analytics matviews (interruptible) + VACUUM ANALYZE"]
Key points:
SKIP_TABLES(deferred):visits, visit_events, communications, communication_recipients, communications_uploads, audit_trails, store_item_audits, data_imports, edi_communication_logs.- Critical vs. analytics matviews.
view_quote_bom_items(the quote builder's
BOM source) is refreshed eagerly, pre-swap, and verified — an empty matview
surfaces in the UI as "No matching controls". The ~22 analytics matviews are
refreshed after the swap (non-blocking) and self-heal via the hourly
MatviewRefreshWorkercron if interrupted. (Mirrored inscript/db_restore.sh
for the dev/local restore.) heatwave_versionsstays schema-only on staging by default. Note its
partitioned tables need annual child partitions created or first-write 500s
with "no partition of relation versions found" —db/versions_structure.sqlnow
carries them (pg_party schema-dump fix, PR #1031).- Flags:
NO_SWAP=1(build but don't go live),NO_DEFERRED=1(core only),
KEEP_DUMP=1,BACKUP_FILE=…,APP_IMAGE=….
Secrets
flowchart LR
subgraph files[".kamal/secrets* (resolver-only, committed)"]
common["secrets-common<br/>RAILS_MASTER_KEY · Sidekiq Pro · GHCR"]
stg["secrets.staging<br/>PG password · staging env-key"]
prod["secrets<br/>PG password · production env-key"]
end
adapter["kamal secrets fetch/extract<br/>(1Password adapter)"]
op[("1Password<br/>warmlyyours.1password.com · vault IT")]
mk["config/master.key (local)"]
common & stg & prod --> adapter --> op
common -. "RAILS_MASTER_KEY / env-key = cat" .-> mk
Model: the .kamal/secrets* files contain no literal secrets — only
resolver expressions (kamal secrets fetch --adapter 1password … + kamal secrets extract …, and cat config/master.key). They are therefore committed to git.
A fresh machine resolves everything with a signed-in 1Password (warmlyyours
account, IT vault) plus a local config/master.key.
Always validate before deploying:
kamal secrets print -d staging # and `kamal secrets print` for prod
Extract-key gotcha. The adapter strips a trailing
/passwordfrom the
map key — a…/passwordfield extracts asIT/<Item>(no/password); other
fields (/credential) keep the field name. This is why the secrets files read
extract IT/Heatwave-Staging-Postgres(not…/password).
1Password items (vault IT)
| Item | Used for |
|---|---|
Sidekiq-Pro/credential |
BUNDLE_GEMS__CONTRIBSYS__COM (build-time gem auth) |
GitHub-ghcr-deploy/credential |
KAMAL_REGISTRY_PASSWORD (GHCR push/pull) |
Heatwave-Staging-Postgres/password |
staging PG accessory + app DB password |
Heatwave-Postgres/password |
prod DB password — create before cutover |
AppSignal-account-push-key/credential |
post-deploy sourcemap upload (account-wide key; the site key 401s) |
Tailscale-Kamal/credential |
cloud-init Tailscale auth key |
Cloudflare-Account-API-Token/credential |
tunnel/DNS/Access Terraform |
Latitude-API/credential |
bare-metal host provisioning |
Service-account token (headless / CI / flaky desktop app)
The 1Password desktop-app CLI integration occasionally fails with "couldn't
connect to the 1Password desktop app". The robust path is a service-account
token — no desktop app, no biometric:
# Save the token to a gitignored file scoped to deploys (NOT your interactive shell):
printf '%s' '<token>' > .kamal/.op-service-account-token # gitignored
# bin/deploy reads it automatically; CI can export OP_SERVICE_ACCOUNT_TOKEN instead.
bin/deploy's op_session() short-circuits to the token when present. Rotate by
overwriting the file. (bin/setup can populate .env.mcp.local from
op://IT/1password-heatwave-ops.)
Email (mailpit)
Staging captures all outbound mail in mailpit instead of sending for real
(reset tokens, the noisy scheduler/Sidekiq mail, campaigns):
- UI:
http://100.123.47.52:8025(Tailscale only — never public). - App/sidekiq deliver to
heatwave-mailpit:1025over thekamalnetwork
(config/environments/staging.rb);config.x.mailpit_urldrives the admin/campaign
UI links. - One-time after the first deploy:
kamal accessory boot mailpit -d staging. - To escape to real SendGrid for a test, the staging mailer honours
SEND_FOR_REAL=y.
Sidekiq
A single consolidated container (SIDEKIQ_CONSOLIDATED=1) runs every queue
class via capsules (high/low/campaign at concurrency 9/10/10), the default set,
and the scheduler — see config/initializers/sidekiq.rb + config/sidekiq.yml.
kamal app logs --roles=sidekiq -d staging -f
kamal app boot --roles=sidekiq -d staging # restart (un-quiet) the worker
- Rolling deploys quiet it (TSTP) via
.kamal/hooks/pre-deploy;super_fetch
recovers in-flight jobs, so no job is lost on a swap. - To split queue classes back onto separate hosts later, restore one role per
config (sidekiq_high.yml, …) and dropSIDEKIQ_CONSOLIDATED(else a queue
would be served by both a capsule and a dedicated process).
Bulk operations (>1000 records / jobs) follow the count-first, two-confirmation
protocol inCLAUDE.md— a careless mass-enqueue against the shared:default
queue is hard to undo. Surface the count before enqueuing.
Scaling & tuning
- Web concurrency —
PUMA_WORKERS/WEB_CONCURRENCY(4) + thread counts in
config/deploy.yml env.clear. Tune for the shared box. - Add a host to a role — add its IP under
servers.<role>.hostsand redeploy.
kamal-proxy on each host load-balances independently behind the tunnel. - GC —
RUBY_GC_*heap-tuning envs, carried over from the pre-Kamal Puma config. - Postgres — the staging accessory
cmdinconfig/deploy.staging.ymlis
tuned down (shared_buffers=8GB,effective_cache_size=24GB) because the box
(192 GB) is shared with the co-located prod stack; prod PG18 gets its own
full-size tuning.
Provisioning a new box (Terraform / OpenTofu)
Two decoupled modules under infra/terraform/. Use OpenTofu (tofu).
flowchart TB
subgraph cfmod["infra/terraform/cloudflare/"]
t1["tunnel (remotely-managed)"] --> t2["DNS CNAMEs → *.cfargotunnel.com"]
t1 --> t3["Access app + policy (wy-employees)"]
t1 --> tok["output: tunnel_token (sensitive)"]
end
subgraph latmod["infra/terraform/latitude/"]
l1["SSH keys (files/authorized_keys)"] --> l2["latitudesh_server (RAID-1)"]
l3["cloud-init: deploy uid 1001 · Docker · Tailscale ·<br/>UFW + DOCKER-USER · cloudflared"] --> l2
l4["edge firewall: :22 ← 100.64.0.0/10"] --> l2
end
tok -->|"-var cloudflared_token=…"| l3
# 1. Cloudflare side (tunnel + DNS + Access) — CLOUDFLARE_API_TOKEN via direnv:
cd infra/terraform/cloudflare && tofu init && tofu apply
# 2. Latitude box, wired to that tunnel:
cd ../latitude
export LATITUDESH_AUTH_TOKEN="$(op read op://IT/Latitude-API/credential)"
tofu init && tofu apply \
-var project=<id> \
-var hostname=<name> \
-var tailscale_auth_key="$(op read op://IT/Tailscale-Kamal/credential)" \
-var cloudflared_token="$(tofu -chdir=../cloudflare output -raw tunnel_token)"
cloud-init yields a fully-wired box (Docker + Tailscale + cloudflared + UFW +
DOCKER-USER). The deploy user is pinned to uid 1001 so it matches the
container's USER 1001 and can own Kamal's asset_path bind-mount (otherwise the
post-deploy DELETE_MAPS sourcemap cleanup fails with EACCES). Then
bin/deploy -d <dest>; the server bootstrap is a near no-op.
The current staging box (
dal-latitude-heatwave-01, f4-metal-medium / Ubuntu 26.04 /
ZFS data plane) was provisioned via this module (infra/terraform/latitude,
setup_zfs_data=true) — it's the reproducible recipe in use. The earlier hand-built
Ashburn box it replaced has been decommissioned. To adopt an already-running box into
state instead of rebuilding,tofu import latitudesh_server.host <id>and install
cloudflared by hand once (cloud-init won't retroactively run).
See doc/tasks/202606112045_DB_TIER_HA_ARCHITECTURE.md for the current two-region
HA topology (PG18 primary in Dallas + cross-DC streaming standby in Chicago,
fronted by per-node pgbouncer + the HAProxy write-VIP heatwave-haproxy:6433, with
pg_promote-driven failover). INFRASTRUCTURE_INVENTORY.md is the live host/port
reference. (The older …202606041041_BARE_METAL_HA_STACK.md described a
Chicago-primary / Ashburn-standby end-state and is superseded.)