Managing the Kamal Stack — Day-2 Operations

Everything after the deploy: logs, console, accessories, database restores,
secrets, mailpit, scaling, and provisioning a new box. For the deploy itself see
DEPLOYING.md; for failures see TROUBLESHOOTING.md.

All kamal commands below assume the mise exec -- bundle exec kamal prefix
(abbreviated kamal here). Add -d staging for the staging destination; omit it
for production. SSH/psql/UI access to the box is over Tailscale only.


Everyday commands

Task Command
Tail app logs kamal app logs -d staging -f
Logs for one role kamal app logs -d staging --roles=sidekiq -f
Rails console kamal console -d staging (alias → app exec --interactive --reuse)
Shell in the container kamal shell -d staging
DB console kamal dbc -d staging
Deployed versions kamal app versions -d staging
Container status kamal app details -d staging
Restart a role kamal app boot --roles=web -d staging
Run a one-off task kamal app exec -d staging --reuse 'bin/rails runner "…"'
Print resolved secrets kamal secrets print -d staging

Direct on the box (over Tailscale): ssh deploy@100.123.47.52, then
docker ps, docker logs <name>, docker stats.


Accessories (staging)

The postgres, the three valkey flavors, and mailpit containers are
accessories — Kamal manages them but they are not rebuilt on an app
deploy. Lifecycle:

kamal accessory boot   postgres       -d staging   # create/start (first run + after host reboot)
kamal accessory boot   mailpit        -d staging   # one-time after first deploy
kamal accessory reboot valkey_cache   -d staging   # restart one flavor (also valkey_sessions / valkey_queue)
kamal accessory logs   postgres       -d staging -f
kamal accessory details              -d staging   # all accessories

Boot order after a host reboot. Accessories and the app come up
independently; the app may crash-loop briefly until Postgres is ready.
restart: unless-stopped retries it, so it self-heals — but if the app is down
after a reboot, check the accessories first (docker ps).

Accessory config lives in config/deploy.staging.yml:

  • postgres — custom PG18 image, tuned cmd (shared_buffers=8GB, etc.), data on
    the pgdata host volume, published 127.0.0.1:5432.
  • valkey ×3valkey/valkey:9.1 in a 3-flavor split (parity with prod):
    heatwave-staging-valkey-cache (allkeys-lru), -sessions (noeviction),
    -queue (noeviction + AOF), each on its own conf in config/valkey/. The app
    routes to them per logical DB via REDIS_CACHE_HOST / REDIS_SESSIONS_HOST /
    REDIS_QUEUE_HOST (config/initializers/100_redis_config.rb) — there is no
    single REDIS_HOST. Internal to the kamal network, not host-published.
  • mailpit — bound to the Tailscale IP 100.123.47.52:8025 (UI) + internal
    :1025 (SMTP).

Database restore

Staging data is refreshed from the newest prod dump (Databasus → Cloudflare R2,
the backup-of-record; BACKUP_SOURCE=wasabi is a legacy fallback) using the
fast + deferred strategy (never a naive full pg_restore — the
communications hash index alone took 82 min on 8.7M rows and blocked everything).

# On the box (scp the script over), with the R2 bucket creds + the accessory PG password:
AWS_ACCESS_KEY_ID=… AWS_SECRET_ACCESS_KEY=… PGPASSWORD=… ./db_restore_kamal.sh

What it does (script/db_restore_kamal.sh):

flowchart TB
    a["download + decompress newest Databasus/R2 dump"] --> b
    b["build TOCs: fast (skip large tables) + deferred (only large tables)"] --> c
    c["schema-only restore → indexes build on EMPTY tables (instant)"] --> d
    d["FAST data restore (core tables, -L fast_toc)"] --> e
    e["refresh CRITICAL matviews (view_quote_bom_items) — verified, pre-swap"] --> f
    f["swap heatwave_restore → heatwave + restart app  ★ CORE DB LIVE"] --> g
    g["DEFERRED: load large tables (-L deferred_toc)"] --> h
    h["refresh analytics matviews (interruptible) + VACUUM ANALYZE"]

Key points:

  • SKIP_TABLES (deferred): visits, visit_events, communications, communication_recipients, communications_uploads, audit_trails, store_item_audits, data_imports, edi_communication_logs.
  • Critical vs. analytics matviews. view_quote_bom_items (the quote builder's
    BOM source) is refreshed eagerly, pre-swap, and verified — an empty matview
    surfaces in the UI as "No matching controls". The ~22 analytics matviews are
    refreshed after the swap (non-blocking) and self-heal via the hourly
    MatviewRefreshWorker cron if interrupted. (Mirrored in script/db_restore.sh
    for the dev/local restore.)
  • heatwave_versions stays schema-only on staging by default. Note its
    partitioned tables need annual child partitions created or first-write 500s
    with "no partition of relation versions found" — db/versions_structure.sql now
    carries them (pg_party schema-dump fix, PR #1031).
  • Flags: NO_SWAP=1 (build but don't go live), NO_DEFERRED=1 (core only),
    KEEP_DUMP=1, BACKUP_FILE=…, APP_IMAGE=….

Secrets

flowchart LR
    subgraph files[".kamal/secrets* (resolver-only, committed)"]
        common["secrets-common<br/>RAILS_MASTER_KEY · Sidekiq Pro · GHCR"]
        stg["secrets.staging<br/>PG password · staging env-key"]
        prod["secrets<br/>PG password · production env-key"]
    end
    adapter["kamal secrets fetch/extract<br/>(1Password adapter)"]
    op[("1Password<br/>warmlyyours.1password.com · vault IT")]
    mk["config/master.key (local)"]

    common & stg & prod --> adapter --> op
    common -. "RAILS_MASTER_KEY / env-key = cat" .-> mk

Model: the .kamal/secrets* files contain no literal secrets — only
resolver expressions (kamal secrets fetch --adapter 1password … + kamal secrets extract …, and cat config/master.key). They are therefore committed to git.
A fresh machine resolves everything with a signed-in 1Password (warmlyyours
account, IT vault) plus a local config/master.key.

Always validate before deploying:

kamal secrets print -d staging      # and `kamal secrets print` for prod

Extract-key gotcha. The adapter strips a trailing /password from the
map key — a …/password field extracts as IT/<Item> (no /password); other
fields (/credential) keep the field name. This is why the secrets files read
extract IT/Heatwave-Staging-Postgres (not …/password).

1Password items (vault IT)

Item Used for
Sidekiq-Pro/credential BUNDLE_GEMS__CONTRIBSYS__COM (build-time gem auth)
GitHub-ghcr-deploy/credential KAMAL_REGISTRY_PASSWORD (GHCR push/pull)
Heatwave-Staging-Postgres/password staging PG accessory + app DB password
Heatwave-Postgres/password prod DB password — create before cutover
AppSignal-account-push-key/credential post-deploy sourcemap upload (account-wide key; the site key 401s)
Tailscale-Kamal/credential cloud-init Tailscale auth key
Cloudflare-Account-API-Token/credential tunnel/DNS/Access Terraform
Latitude-API/credential bare-metal host provisioning

Service-account token (headless / CI / flaky desktop app)

The 1Password desktop-app CLI integration occasionally fails with "couldn't
connect to the 1Password desktop app". The robust path is a service-account
token
— no desktop app, no biometric:

# Save the token to a gitignored file scoped to deploys (NOT your interactive shell):
printf '%s' '<token>' > .kamal/.op-service-account-token   # gitignored
# bin/deploy reads it automatically; CI can export OP_SERVICE_ACCOUNT_TOKEN instead.

bin/deploy's op_session() short-circuits to the token when present. Rotate by
overwriting the file. (bin/setup can populate .env.mcp.local from
op://IT/1password-heatwave-ops.)


Email (mailpit)

Staging captures all outbound mail in mailpit instead of sending for real
(reset tokens, the noisy scheduler/Sidekiq mail, campaigns):

  • UI: http://100.123.47.52:8025 (Tailscale only — never public).
  • App/sidekiq deliver to heatwave-mailpit:1025 over the kamal network
    (config/environments/staging.rb); config.x.mailpit_url drives the admin/campaign
    UI links.
  • One-time after the first deploy: kamal accessory boot mailpit -d staging.
  • To escape to real SendGrid for a test, the staging mailer honours SEND_FOR_REAL=y.

Sidekiq

A single consolidated container (SIDEKIQ_CONSOLIDATED=1) runs every queue
class via capsules (high/low/campaign at concurrency 9/10/10), the default set,
and the scheduler — see config/initializers/sidekiq.rb + config/sidekiq.yml.

kamal app logs --roles=sidekiq -d staging -f
kamal app boot --roles=sidekiq -d staging      # restart (un-quiet) the worker
  • Rolling deploys quiet it (TSTP) via .kamal/hooks/pre-deploy; super_fetch
    recovers in-flight jobs, so no job is lost on a swap.
  • To split queue classes back onto separate hosts later, restore one role per
    config (sidekiq_high.yml, …) and drop SIDEKIQ_CONSOLIDATED (else a queue
    would be served by both a capsule and a dedicated process).

Bulk operations (>1000 records / jobs) follow the count-first, two-confirmation
protocol in CLAUDE.md — a careless mass-enqueue against the shared :default
queue is hard to undo. Surface the count before enqueuing.


Scaling & tuning

  • Web concurrencyPUMA_WORKERS / WEB_CONCURRENCY (4) + thread counts in
    config/deploy.yml env.clear. Tune for the shared box.
  • Add a host to a role — add its IP under servers.<role>.hosts and redeploy.
    kamal-proxy on each host load-balances independently behind the tunnel.
  • GCRUBY_GC_* heap-tuning envs, carried over from the pre-Kamal Puma config.
  • Postgres — the staging accessory cmd in config/deploy.staging.yml is
    tuned down (shared_buffers=8GB, effective_cache_size=24GB) because the box
    (192 GB) is shared with the co-located prod stack; prod PG18 gets its own
    full-size tuning.

Provisioning a new box (Terraform / OpenTofu)

Two decoupled modules under infra/terraform/. Use OpenTofu (tofu).

flowchart TB
    subgraph cfmod["infra/terraform/cloudflare/"]
        t1["tunnel (remotely-managed)"] --> t2["DNS CNAMEs → *.cfargotunnel.com"]
        t1 --> t3["Access app + policy (wy-employees)"]
        t1 --> tok["output: tunnel_token (sensitive)"]
    end
    subgraph latmod["infra/terraform/latitude/"]
        l1["SSH keys (files/authorized_keys)"] --> l2["latitudesh_server (RAID-1)"]
        l3["cloud-init: deploy uid 1001 · Docker · Tailscale ·<br/>UFW + DOCKER-USER · cloudflared"] --> l2
        l4["edge firewall: :22 ← 100.64.0.0/10"] --> l2
    end
    tok -->|"-var cloudflared_token=…"| l3
# 1. Cloudflare side (tunnel + DNS + Access) — CLOUDFLARE_API_TOKEN via direnv:
cd infra/terraform/cloudflare && tofu init && tofu apply

# 2. Latitude box, wired to that tunnel:
cd ../latitude
export LATITUDESH_AUTH_TOKEN="$(op read op://IT/Latitude-API/credential)"
tofu init && tofu apply \
  -var project=<id> \
  -var hostname=<name> \
  -var tailscale_auth_key="$(op read op://IT/Tailscale-Kamal/credential)" \
  -var cloudflared_token="$(tofu -chdir=../cloudflare output -raw tunnel_token)"

cloud-init yields a fully-wired box (Docker + Tailscale + cloudflared + UFW +
DOCKER-USER). The deploy user is pinned to uid 1001 so it matches the
container's USER 1001 and can own Kamal's asset_path bind-mount (otherwise the
post-deploy DELETE_MAPS sourcemap cleanup fails with EACCES). Then
bin/deploy -d <dest>; the server bootstrap is a near no-op.

The current staging box (dal-latitude-heatwave-01, f4-metal-medium / Ubuntu 26.04 /
ZFS data plane) was provisioned via this module (infra/terraform/latitude,
setup_zfs_data=true) — it's the reproducible recipe in use. The earlier hand-built
Ashburn box it replaced has been decommissioned. To adopt an already-running box into
state instead of rebuilding, tofu import latitudesh_server.host <id> and install
cloudflared by hand once (cloud-init won't retroactively run).

See doc/tasks/202606112045_DB_TIER_HA_ARCHITECTURE.md for the current two-region
HA topology (PG18 primary in Dallas + cross-DC streaming standby in Chicago,
fronted by per-node pgbouncer + the HAProxy write-VIP heatwave-haproxy:6433, with
pg_promote-driven failover). INFRASTRUCTURE_INVENTORY.md is the live host/port
reference. (The older …202606041041_BARE_METAL_HA_STACK.md described a
Chicago-primary / Ashburn-standby end-state and is superseded.)