Kamal Stack — Troubleshooting Runbook

Symptom → cause → fix for the failure modes we've actually hit. Commands assume
mise exec -- bundle exec kamal (abbreviated kamal), -d staging for staging.
Box access is over Tailscale (ssh deploy@100.123.47.52).


Logs & troubleshooting — developer quick reference

Two prerequisites for every command below:

  1. Be on Tailscale. Every box is reachable only over the tailnet — no public SSH
    or ports. tailscale status should list dal-latitude-heatwave-01 (Dallas — prod + staging,
    100.123.47.52) and chi-latitude-heatwave-02 (Chicago — standby, 100.68.157.49).
  2. Prefix kamal with the toolchain + pick the env:
    mise exec -- bundle exec kamal <cmd> -d staging (or -d production).

Looking at logs — there are four layers

Layer Command What you see
App (web + sidekiq) kamal app logs -d staging -f Rails/Puma + Sidekiq stdout (live). --roles=web / --roles=sidekiq to split; -n 200 for backlog; --grep ERROR to filter
Accessories (postgres / valkey ×3 / mailpit) kamal accessory logs postgres -d staging -f (or valkey_cache / valkey_sessions / valkey_queue) DB / cache / queue / mail-sink container logs
Edge connector (cloudflared, host systemd unit) ssh deploy@100.123.47.52 'journalctl -u cloudflared -n 50 -f' Tunnel up/down, ingress routing
Errors / APM AppSignal apps Heatwave/staging & Heatwave/production Exceptions, traces, slow requests — richer than tailing

Raw fallback on the box (when you need everything Docker sees):
ssh deploy@100.123.47.52, then docker ps, docker logs <container> --tail 100 -f,
docker stats. (Dallas/app 100.123.47.52, Chicago/standby 100.68.157.49.)

Troubleshooting flow

  1. App or edge? Curl the proxy on the box, bypassing Cloudflare:
    ssh deploy@100.123.47.52 'curl -s -o /dev/null -w "%{http_code}\n" \
      -H "Host: www.warmlyyours.ws" -H "X-Forwarded-Proto: https" http://localhost/up'
    
    200 → the app is fine; the problem is at the edge (cloudflared / Cloudflare Access /
    tunnel). Anything else → app / proxy (next steps).
  2. What's running? kamal app details -d staging (+ docker ps on the box) — healthy?
    crash-looping? which image version (kamal app versions -d staging)?
  3. Get inside: kamal console -d staging (Rails console), kamal shell -d staging
    (bash), kamal dbc -d staging (psql), or one-offs:
    kamal app exec -d staging --reuse 'bin/rails runner "…"'.
  4. Specific symptom? Jump to the matching section below (healthcheck timeouts,
    versions partition 500s, matviews empty after a restore, OAuth redirect, uid-1001
    asset EACCES, …).

The rest of this doc is the symptom-by-symptom runbook; the everyday-command table and
day-2 ops live in MANAGING.md, the deploy flow in
DEPLOYING.md, the architecture in README.md.


First-look triage

flowchart TD
    s{"What's broken?"}
    s -->|"Site won't load at all"| edge{"curl the box's kamal-proxy<br/>directly — does /up 200?"}
    s -->|"Deploy fails"| dep{"Where did it stop?"}
    s -->|"App up but errors / 500s"| app{"What kind of error?"}
    s -->|"Can't reach over Tailscale"| net["Check tailnet: tailscale status,<br/>use the 100.x IP, edge firewall :22"]

    edge -->|"yes (app fine)"| cf["Edge problem → check cloudflared<br/>+ Cloudflare Access / tunnel"]
    edge -->|"no"| ct["App/proxy down → kamal app details,<br/>docker ps, accessory boot order"]

    dep -->|"before build"| sec["1Password / clean-tree gate"]
    dep -->|"build"| bld["build secrets / .dockerignore / deps"]
    dep -->|"boot / healthcheck"| up["/up + force_ssl + config.hosts"]

    app -->|"DB error"| db["partitions / matviews / restore gaps"]
    app -->|"login blocked"| oauth["OAuth redirect URI"]
    app -->|"EACCES on assets"| uid["uid 1001 mismatch"]

A fast way to split "app vs edge": curl the proxy on the box, bypassing Cloudflare:

ssh deploy@100.123.47.52 \
  'curl -sS -o /dev/null -w "%{http_code}\n" -H "Host: www.warmlyyours.ws" \
     -H "X-Forwarded-Proto: https" http://localhost:80/up'

200 = app healthy, problem is at the edge. Anything else = app/proxy problem.


Edge / ingress

"Site can't be reached" / connection refused

  • :3000 (or any non-standard port) in the URL. :3000 is Puma's internal
    container port; Cloudflare only proxies standard web ports. The public URL has
    no porthttps://www.warmlyyours.ws/en-US. The app never emits :3000
    links itself. Fix: drop the port / clear the cached bookmark.
  • Tunnel down. cloudflared runs as a host systemd service, not a container:
    ssh deploy@100.123.47.52 'systemctl status cloudflared; journalctl -u cloudflared -n 50'
    
    Healthy = "active (running)" with QUIC connectors registered. Restart with
    sudo systemctl restart cloudflared.
  • Edge is up but app is down. If the direct-proxy curl above is non-200, it's
    not the edge — go to Deploy / boot.

Stuck at the Cloudflare Access login

A 302 to warmlyyours.cloudflareaccess.com is expected — staging is gated by
the wy-employees Access group. Log in with a @warmlyyours.com identity. If a
legitimate user is denied, check the Access policy/group in the Cloudflare Zero
Trust dashboard (or infra/terraform/cloudflare/).

Google login blocked (redirect_uri_mismatch)

New staging hostnames aren't in the Google OAuth client's authorized redirect URIs.
The CRM login uses Devise's google_oauth2 provider; the callback is
/accounts/auth/google_oauth2/callback. Staging has its own OAuth client
(114261933316-b7694…, project "WY API Project") — not prod's. Add:

https://crm.warmlyyours.ws/accounts/auth/google_oauth2/callback

to that client's Authorized redirect URIs in Google Cloud Console. (No JS origins
needed — it's a server-side code flow.) If the consent screen itself blocks, the
registrable domain warmlyyours.ws may need adding to the consent screen's
Authorized domains (External apps) — Internal/Workspace apps are exempt.


Deploy / boot

1Password: "couldn't connect to the 1Password desktop app"

The desktop-app CLI integration is flaky. In order of reliability:

  1. ⌘Q the 1Password app fully (not just close the window), reopen, unlock.
    This clears a stuck CLI-integration helper and is the usual fix. Verify:
    op vault list. Confirm Settings → Developer → "Integrate with 1Password CLI" is ON.
  2. Manual session in the deploy terminal: eval "$(op signin --account warmlyyours.1password.com)", then re-run.
  3. Service-account token (most reliable — skips the desktop app entirely): save
    it to .kamal/.op-service-account-token (gitignored). See
    MANAGING.md → Secrets.

bin/deploy hard-gates on a real op read before the build, so a secret
failure surfaces early with the exact op error.

"Working tree is not clean / out of sync"

The clean-tree gate refuses to ship code that matches no pushed commit (Kamal builds
the working tree). Commit + push, or use --allow-dirty (throwaway staging test) /
--in-worktree (clean master). See DEPLOYING.md.

Build failures

Error Cause → fix
sidekiq-pro 401 during bundle install BUNDLE_GEMS__CONTRIBSYS__COM didn't resolve. Validate kamal secrets print -d staging; check op://IT/Sidekiq-Pro/credential.
Registry push 401 GHCR auth — op://IT/GitHub-ghcr-deploy/credential (needs write:packages). The token was refreshed 2026-06; re-check if it expired.
db/structure.sql doesn't exist .dockerignore stripped a needed file — un-ignore it.
vips: not found / a CLI missing at runtime Add the package to the final stage's apt list in the Dockerfile (e.g. libvips-tools).
webpack rebuilds from scratch every deploy Known cache-invalidation issue — see doc/tasks/202606051345_WEBPACK_ASSET_BUILD_SPEEDUP.md (cache mount + revision decouple; likely folded into the Vite migration).

New container never goes healthy (/up times out, proxy gives up)

  • /up returns 403. config.hosts allow-listing or force_ssl is rejecting
    the internal health probe. /up must be exempt from both — confirm the
    exemption in config/environments/*.rb. (This bit us on the first staging deploy.)
  • Boot is just slow. Puma preload is ~20s; deploy_timeout is 90s. If a heavy
    boot legitimately needs more, raise deploy_timeout in config/deploy.yml.
  • DB unreachable at boot. The app can't boot without its DB. In staging, check
    the Postgres accessory is up (docker ps). In prod the app connects through the
    HAProxy write-VIP (DATABASE_HOST=heatwave-haproxy:6433) → the live primary's
    pgbouncer → Postgres, so check that chain: is HAProxy showing a healthy backend
    (http://100.123.47.52:8404/), is the primary's pg-health returning 200, and
    is the Postgres accessory up (Dallas primary, or Chicago standby after a flip)?
    See INFRASTRUCTURE_INVENTORY.md for the routing layout.

App crash-loops after a host reboot

Accessories and the app start independently, so the app may crash-loop until
Postgres is ready. restart: unless-stopped retries it, so it usually self-heals.
If not, boot the accessories first:

ssh deploy@100.123.47.52 docker ps                  # are postgres + the valkey trio up?
kamal accessory boot postgres        -d staging
kamal accessory boot valkey_cache    -d staging      # + valkey_sessions, valkey_queue
kamal accessory boot valkey_sessions -d staging
kamal accessory boot valkey_queue    -d staging
kamal app boot -d staging

Sidekiq stuck quieted after a failed deploy

pre-deploy quiets Sidekiq (TSTP); on a failed deploy bin/deploy un-quiets it
automatically. If you ran bare kamal or the auto-resume failed:

kamal app boot --roles=sidekiq -d staging

Database

Quote builder returns "No matching controls" (but prod is fine)

The view_quote_bom_items matview is empty. After a restore, matviews get
refreshed during the schema-only phase (on empty base tables) and, if not
re-refreshed against loaded data, stay empty. An empty BOM matview surfaces as
"No matching controls" (the elements error is overwritten by the controls one in
heating_system_items.rb). Fix:

kamal accessory exec postgres -d staging \
  "psql -U deploy -d heatwave -c 'REFRESH MATERIALIZED VIEW public.view_quote_bom_items;'"

The restore scripts now refresh this critical matview eagerly + verified
pre-swap, so a fresh restore can't reproduce it (see
MANAGING.md → Database restore).

versions write 500s: "no partition of relation versions found"

PaperTrail's versions table (separate heatwave_versions DB, FDW-backed) is
range-partitioned by year. db/versions_structure.sql historically shipped with
no child partitions, so the first write after a fresh load 500s. Fix: create the
annual partitions + versions_default, then the structure dump carries them (the
pg_party schema_exclude_partitions = false fix, config/initializers/355_pg_party.rb,
PR #1031). On a box that's already broken:

-- one-off, per missing year, in heatwave_versions:
CREATE TABLE versions_2026 PARTITION OF versions
  FOR VALUES FROM ('2026-01-01') TO ('2027-01-01');
CREATE TABLE versions_default PARTITION OF versions DEFAULT;

Analytics dashboards empty after a restore

The ~22 analytics matviews (view_sales_facts, view_opportunities_facts,
view_visits_*, …) are refreshed after the swap (deferred, non-blocking) and
self-heal via the hourly MatviewRefreshWorker cron. If staging's scheduler is
running they fill in on their own; to force it:

kamal app exec -d staging --reuse \
  'bin/rails runner "MatviewRefreshWorker.new.perform"'

Postgres collation mismatch on accessory boot

initdb collation (e.g. 2.41 vs 2.36) differs between the stock and custom images.
The pgdata volume must be initialized under the custom image
(ghcr.io/warmlyyours/heatwave-postgres:18, built from docker/postgresql.Dockerfile).
If you see a collation-version warning, re-initdb the volume under that image — or,
if the data is fine, ALTER DATABASE … REFRESH COLLATION VERSION. Relocating the
image between registries does not change collation (the bytes are identical); a
mismatch only appears if the pgvector/pgvector:pg18 base was rebuilt with a newer
glibc, so prefer re-tagging an existing image over rebuilding when you just need to
move registries.


App runtime

Sourcemap upload fails with EACCES (DELETE_MAPS)

The asset_path host volume is owned by the host deploy user (uid 1001); a
container running as a different uid can't rewrite/clean the bridged assets. The
container app user must be uid 1001 (Dockerfile USER 1001, and cloud-init
pins the deploy user to uid 1001). If you see EACCES on the post-deploy sourcemap
cleanup, the uids have drifted — check both.

MissingKeyError: ENV['staging'] (or ['production'])

Heatwave's Heatwave::Configuration is keyed on ENV['<environment>'], so each
destination needs a staging / production secret = cat config/master.key
(present in .kamal/secrets.staging / .kamal/secrets). Validate with
kamal secrets print -d staging.

"Sunny is broken" / AI 400s

Unrelated to Kamal — that's the AppSignal #3808 body-less Gemini 400 class. See the
Sunny memory/notes, not this runbook.


Network / access

  • Can't SSH to the box. SSH is Tailscale-only (Latitude edge firewall allows
    :22 from 100.64.0.0/10 only). Confirm you're on the tailnet (tailscale status)
    and using the Tailscale IP 100.123.47.52.
  • psql / mailpit UI unreachable. Same — those bind to 127.0.0.1 (psql) or the
    Tailscale IP (mailpit :8025). They are never publicly exposed.
  • A published container port is unexpectedly world-reachable. Docker bypasses UFW
    for published ports; the DOCKER-USER chain (cloud-init) is what blocks public
    :80/:443. Verify it loaded: sudo iptables -L DOCKER-USER -n.

Escalation references

  • Migration status + cutover gates — doc/tasks/202606022303_KAMAL_MIGRATION.md
  • DB-tier HA topology (Dallas primary / Chicago standby + HAProxy/pgbouncer) — doc/tasks/202606112045_DB_TIER_HA_ARCHITECTURE.md; live host/port reference — doc/infrastructure/INFRASTRUCTURE_INVENTORY.md
  • Webpack build speedup — doc/tasks/202606051345_WEBPACK_ASSET_BUILD_SPEEDUP.md
  • Errors / APM — the appsignal skill (AppSignal incidents, traces, logs)
  • Cloudflare zones/tokens/workers — doc/infrastructure/CLOUDFLARE.md