Skip to content

Kamal Stack — Troubleshooting Runbook

Symptom → cause → fix for the failure modes we’ve actually hit. Commands assume mise exec -- bundle exec kamal (abbreviated kamal), -d staging for staging. Box access is over Tailscale (ssh deploy@100.123.47.52).


Logs & troubleshooting — developer quick reference

Section titled “Logs & troubleshooting — developer quick reference”

Two prerequisites for every command below:

  1. Be on Tailscale. Every box is reachable only over the tailnet — no public SSH or ports. tailscale status should list dal-latitude-heatwave-01 (Dallas — prod + staging, 100.123.47.52) and chi-latitude-heatwave-02 (Chicago — standby, 100.68.157.49).
  2. Prefix kamal with the toolchain + pick the env: mise exec -- bundle exec kamal <cmd> -d staging (or -d production).
LayerCommandWhat you see
App (web + sidekiq)kamal app logs -d staging -fRails/Puma + Sidekiq stdout (live). --roles=web / --roles=sidekiq to split; -n 200 for backlog; --grep ERROR to filter
Accessories (postgres / valkey ×3 / mailpit)kamal accessory logs postgres -d staging -f (or valkey_cache / valkey_sessions / valkey_queue)DB / cache / queue / mail-sink container logs
Edge connector (cloudflared, host systemd unit)ssh deploy@100.123.47.52 'journalctl -u cloudflared -n 50 -f'Tunnel up/down, ingress routing
Errors / APMAppSignal apps Heatwave/staging & Heatwave/productionExceptions, traces, slow requests — richer than tailing

Raw fallback on the box (when you need everything Docker sees): ssh deploy@100.123.47.52, then docker ps, docker logs <container> --tail 100 -f, docker stats. (Dallas/app 100.123.47.52, Chicago/standby 100.68.157.49.)

  1. App or edge? Curl the proxy on the box, bypassing Cloudflare:
    Terminal window
    ssh deploy@100.123.47.52 'curl -s -o /dev/null -w "%{http_code}\n" \
    -H "Host: www.warmlyyours.ws" -H "X-Forwarded-Proto: https" http://localhost/up'
    200 → the app is fine; the problem is at the edge (cloudflared / Cloudflare Access / tunnel). Anything else → app / proxy (next steps).
  2. What’s running? kamal app details -d staging (+ docker ps on the box) — healthy? crash-looping? which image version (kamal app versions -d staging)?
  3. Get inside: kamal console -d staging (Rails console), kamal shell -d staging (bash), kamal dbc -d staging (psql), or one-offs: kamal app exec -d staging --reuse 'bin/rails runner "…"'.
  4. Specific symptom? Jump to the matching section below (healthcheck timeouts, versions partition 500s, matviews empty after a restore, OAuth redirect, uid-1001 asset EACCES, …).

The rest of this doc is the symptom-by-symptom runbook; the everyday-command table and day-2 ops live in MANAGING.md, the deploy flow in DEPLOYING.md, the architecture in README.md.


flowchart TD
s{"What's broken?"}
s -->|"Site won't load at all"| edge{"curl the box's kamal-proxy<br/>directly — does /up 200?"}
s -->|"Deploy fails"| dep{"Where did it stop?"}
s -->|"App up but errors / 500s"| app{"What kind of error?"}
s -->|"Can't reach over Tailscale"| net["Check tailnet: tailscale status,<br/>use the 100.x IP, edge firewall :22"]
edge -->|"yes (app fine)"| cf["Edge problem → check cloudflared<br/>+ Cloudflare Access / tunnel"]
edge -->|"no"| ct["App/proxy down → kamal app details,<br/>docker ps, accessory boot order"]
dep -->|"before build"| sec["1Password / clean-tree gate"]
dep -->|"build"| bld["build secrets / .dockerignore / deps"]
dep -->|"boot / healthcheck"| up["/up + force_ssl + config.hosts"]
app -->|"DB error"| db["partitions / matviews / restore gaps"]
app -->|"login blocked"| oauth["OAuth redirect URI"]
app -->|"EACCES on assets"| uid["uid 1001 mismatch"]

A fast way to split “app vs edge”: curl the proxy on the box, bypassing Cloudflare:

Terminal window
ssh deploy@100.123.47.52 \
'curl -sS -o /dev/null -w "%{http_code}\n" -H "Host: www.warmlyyours.ws" \
-H "X-Forwarded-Proto: https" http://localhost:80/up'

200 = app healthy, problem is at the edge. Anything else = app/proxy problem.


”Site can’t be reached” / connection refused

Section titled “”Site can’t be reached” / connection refused”
  • :3000 (or any non-standard port) in the URL. :3000 is Puma’s internal container port; Cloudflare only proxies standard web ports. The public URL has no porthttps://www.warmlyyours.ws/en-US. The app never emits :3000 links itself. Fix: drop the port / clear the cached bookmark.
  • Tunnel down. cloudflared runs as a host systemd service, not a container:
    Terminal window
    ssh deploy@100.123.47.52 'systemctl status cloudflared; journalctl -u cloudflared -n 50'
    Healthy = “active (running)” with QUIC connectors registered. Restart with sudo systemctl restart cloudflared.
  • Edge is up but app is down. If the direct-proxy curl above is non-200, it’s not the edge — go to Deploy / boot.

A 302 to warmlyyours.cloudflareaccess.com is expected — staging is gated by the wy-employees Access group. Log in with a @warmlyyours.com identity. If a legitimate user is denied, check the Access policy/group in the Cloudflare Zero Trust dashboard (or infra/terraform/cloudflare/).

Google login blocked (redirect_uri_mismatch)

Section titled “Google login blocked (redirect_uri_mismatch)”

New staging hostnames aren’t in the Google OAuth client’s authorized redirect URIs. The CRM login uses Devise’s google_oauth2 provider; the callback is /accounts/auth/google_oauth2/callback. Staging has its own OAuth client (114261933316-b7694…, project “WY API Project”) — not prod’s. Add:

https://crm.warmlyyours.ws/accounts/auth/google_oauth2/callback

to that client’s Authorized redirect URIs in Google Cloud Console. (No JS origins needed — it’s a server-side code flow.) If the consent screen itself blocks, the registrable domain warmlyyours.ws may need adding to the consent screen’s Authorized domains (External apps) — Internal/Workspace apps are exempt.


1Password: “couldn’t connect to the 1Password desktop app”

Section titled “1Password: “couldn’t connect to the 1Password desktop app””

The desktop-app CLI integration is flaky. In order of reliability:

  1. ⌘Q the 1Password app fully (not just close the window), reopen, unlock. This clears a stuck CLI-integration helper and is the usual fix. Verify: op vault list. Confirm Settings → Developer → “Integrate with 1Password CLI” is ON.
  2. Manual session in the deploy terminal: eval "$(op signin --account warmlyyours.1password.com)", then re-run.
  3. Service-account token (most reliable — skips the desktop app entirely): save it to .kamal/.op-service-account-token (gitignored). See MANAGING.md → Secrets.

bin/deploy hard-gates on a real op read before the build, so a secret failure surfaces early with the exact op error.

”Working tree is not clean / out of sync”

Section titled “”Working tree is not clean / out of sync””

The clean-tree gate refuses to ship code that matches no pushed commit (Kamal builds the working tree). Commit + push, or use --allow-dirty (throwaway staging test) / --in-worktree (clean master). See DEPLOYING.md.

ErrorCause → fix
sidekiq-pro 401 during bundle installBUNDLE_GEMS__CONTRIBSYS__COM didn’t resolve. Validate kamal secrets print -d staging; check op://IT/Sidekiq-Pro/credential.
Registry push 401GHCR auth — op://IT/GitHub-ghcr-deploy/credential (needs write:packages). The token was refreshed 2026-06; re-check if it expired.
db/structure.sql doesn't exist.dockerignore stripped a needed file — un-ignore it.
vips: not found / a CLI missing at runtimeAdd the package to the final stage’s apt list in the Dockerfile (e.g. libvips-tools).
webpack rebuilds from scratch every deployKnown cache-invalidation issue — see doc/tasks/202606051345_WEBPACK_ASSET_BUILD_SPEEDUP.md (cache mount + revision decouple; likely folded into the Vite migration).

New container never goes healthy (/up times out, proxy gives up)

Section titled “New container never goes healthy (/up times out, proxy gives up)”
  • /up returns 403. config.hosts allow-listing or force_ssl is rejecting the internal health probe. /up must be exempt from both — confirm the exemption in config/environments/*.rb. (This bit us on the first staging deploy.)
  • Boot is just slow. Puma preload is ~20s; deploy_timeout is 90s. If a heavy boot legitimately needs more, raise deploy_timeout in config/deploy.yml.
  • DB unreachable at boot. The app can’t boot without its DB. In staging, check the Postgres accessory is up (docker ps). In prod the app connects through the HAProxy write-VIP (DATABASE_HOST=heatwave-haproxy:6433) → the live primary’s pgbouncer → Postgres, so check that chain: is HAProxy showing a healthy backend (http://100.123.47.52:8404/), is the primary’s pg-health returning 200, and is the Postgres accessory up (Dallas primary, or Chicago standby after a flip)? See INFRASTRUCTURE_INVENTORY.md for the routing layout.

Accessories and the app start independently, so the app may crash-loop until Postgres is ready. restart: unless-stopped retries it, so it usually self-heals. If not, boot the accessories first:

Terminal window
ssh deploy@100.123.47.52 docker ps # are postgres + the valkey trio up?
kamal accessory boot postgres -d staging
kamal accessory boot valkey_cache -d staging # + valkey_sessions, valkey_queue
kamal accessory boot valkey_sessions -d staging
kamal accessory boot valkey_queue -d staging
kamal app boot -d staging

Sidekiq stuck quieted after a failed deploy

Section titled “Sidekiq stuck quieted after a failed deploy”

pre-deploy quiets Sidekiq (TSTP); on a failed deploy bin/deploy un-quiets it automatically. If you ran bare kamal or the auto-resume failed:

Terminal window
kamal app boot --roles=sidekiq -d staging

Quote builder returns “No matching controls” (but prod is fine)

Section titled “Quote builder returns “No matching controls” (but prod is fine)”

The view_quote_bom_items matview is empty. After a restore, matviews get refreshed during the schema-only phase (on empty base tables) and, if not re-refreshed against loaded data, stay empty. An empty BOM matview surfaces as “No matching controls” (the elements error is overwritten by the controls one in heating_system_items.rb). Fix:

Terminal window
kamal accessory exec postgres -d staging \
"psql -U deploy -d heatwave -c 'REFRESH MATERIALIZED VIEW public.view_quote_bom_items;'"

The restore scripts now refresh this critical matview eagerly + verified pre-swap, so a fresh restore can’t reproduce it (see MANAGING.md → Database restore).

versions write 500s: “no partition of relation versions found”

Section titled “versions write 500s: “no partition of relation versions found””

PaperTrail’s versions table (separate heatwave_versions DB, FDW-backed) is range-partitioned by year. db/versions_structure.sql historically shipped with no child partitions, so the first write after a fresh load 500s. Fix: create the annual partitions + versions_default, then the structure dump carries them (the pg_party schema_exclude_partitions = false fix, config/initializers/355_pg_party.rb, PR #1031). On a box that’s already broken:

-- one-off, per missing year, in heatwave_versions:
CREATE TABLE versions_2026 PARTITION OF versions
FOR VALUES FROM ('2026-01-01') TO ('2027-01-01');
CREATE TABLE versions_default PARTITION OF versions DEFAULT;

Analytics dashboards empty after a restore

Section titled “Analytics dashboards empty after a restore”

The ~22 analytics matviews (view_sales_facts, view_opportunities_facts, view_visits_*, …) are refreshed after the swap (deferred, non-blocking) and self-heal via the hourly MatviewRefreshWorker cron. If staging’s scheduler is running they fill in on their own; to force it:

Terminal window
kamal app exec -d staging --reuse \
'bin/rails runner "MatviewRefreshWorker.new.perform"'

Postgres collation mismatch on accessory boot

Section titled “Postgres collation mismatch on accessory boot”

initdb collation (e.g. 2.41 vs 2.36) differs between the stock and custom images. The pgdata volume must be initialized under the custom image (ghcr.io/warmlyyours/heatwave-postgres:18, built from docker/postgresql.Dockerfile). If you see a collation-version warning, re-initdb the volume under that image — or, if the data is fine, ALTER DATABASE … REFRESH COLLATION VERSION. Relocating the image between registries does not change collation (the bytes are identical); a mismatch only appears if the pgvector/pgvector:pg18 base was rebuilt with a newer glibc, so prefer re-tagging an existing image over rebuilding when you just need to move registries.


Sourcemap upload fails with EACCES (DELETE_MAPS)

Section titled “Sourcemap upload fails with EACCES (DELETE_MAPS)”

The asset_path host volume is owned by the host deploy user (uid 1001); a container running as a different uid can’t rewrite/clean the bridged assets. The container app user must be uid 1001 (Dockerfile USER 1001, and cloud-init pins the deploy user to uid 1001). If you see EACCES on the post-deploy sourcemap cleanup, the uids have drifted — check both.

MissingKeyError: ENV['staging'] (or ['production'])

Section titled “MissingKeyError: ENV['staging'] (or ['production'])”

Heatwave’s Heatwave::Configuration is keyed on ENV['<environment>'], so each destination needs a staging / production secret = cat config/master.key (present in .kamal/secrets.staging / .kamal/secrets). Validate with kamal secrets print -d staging.

Unrelated to Kamal — that’s the AppSignal #3808 body-less Gemini 400 class. See the Sunny memory/notes, not this runbook.


  • Can’t SSH to the box. SSH is Tailscale-only (Latitude edge firewall allows :22 from 100.64.0.0/10 only). Confirm you’re on the tailnet (tailscale status) and using the Tailscale IP 100.123.47.52.
  • psql / mailpit UI unreachable. Same — those bind to 127.0.0.1 (psql) or the Tailscale IP (mailpit :8025). They are never publicly exposed.
  • A published container port is unexpectedly world-reachable. Docker bypasses UFW for published ports; the DOCKER-USER chain (cloud-init) is what blocks public :80/:443. Verify it loaded: sudo iptables -L DOCKER-USER -n.

  • Migration status + cutover gates — doc/tasks/202606022303_KAMAL_MIGRATION.md
  • DB-tier HA topology (Dallas primary / Chicago standby + HAProxy/pgbouncer) — doc/tasks/202606112045_DB_TIER_HA_ARCHITECTURE.md; live host/port reference — doc/infrastructure/INFRASTRUCTURE_INVENTORY.md
  • Webpack build speedup — doc/tasks/202606051345_WEBPACK_ASSET_BUILD_SPEEDUP.md
  • Errors / APM — the appsignal skill (AppSignal incidents, traces, logs)
  • Cloudflare zones/tokens/workers — doc/infrastructure/CLOUDFLARE.md