Kamal Stack — Troubleshooting Runbook

Symptom → cause → fix for the failure modes we’ve actually hit. Commands assume mise exec -- bundle exec kamal (abbreviated kamal), -d staging for staging. Box access is over Tailscale (ssh deploy@100.123.47.52).

Logs & troubleshooting — developer quick reference

Two prerequisites for every command below:

Be on Tailscale. Every box is reachable only over the tailnet — no public SSH or ports. tailscale status should list dal-latitude-heatwave-01 (Dallas — prod + staging, 100.123.47.52) and chi-latitude-heatwave-02 (Chicago — standby, 100.68.157.49).
Prefix kamal with the toolchain + pick the env: mise exec -- bundle exec kamal <cmd> -d staging (or -d production).

Looking at logs — there are four layers

Layer	Command	What you see
App (web + sidekiq)	`kamal app logs -d staging -f`	Rails/Puma + Sidekiq stdout (live). `--roles=web` / `--roles=sidekiq` to split; `-n 200` for backlog; `--grep ERROR` to filter
Accessories (postgres / valkey ×3 / mailpit)	`kamal accessory logs postgres -d staging -f` (or `valkey_cache` / `valkey_sessions` / `valkey_queue`)	DB / cache / queue / mail-sink container logs
Edge connector (cloudflared, host systemd unit)	`ssh deploy@100.123.47.52 'journalctl -u cloudflared -n 50 -f'`	Tunnel up/down, ingress routing
Errors / APM	AppSignal apps `Heatwave/staging` & `Heatwave/production`	Exceptions, traces, slow requests — richer than tailing

Raw fallback on the box (when you need everything Docker sees): ssh deploy@100.123.47.52, then docker ps, docker logs <container> --tail 100 -f, docker stats. (Dallas/app 100.123.47.52, Chicago/standby 100.68.157.49.)

Troubleshooting flow

App or edge? Curl the proxy on the box, bypassing Cloudflare:
Terminal window
```
ssh deploy@100.123.47.52 'curl -s -o /dev/null -w "%{http_code}\n" \
  -H "Host: www.warmlyyours.ws" -H "X-Forwarded-Proto: https" http://localhost/up'
```
200 → the app is fine; the problem is at the edge (cloudflared / Cloudflare Access / tunnel). Anything else → app / proxy (next steps).
What’s running? kamal app details -d staging (+ docker ps on the box) — healthy? crash-looping? which image version (kamal app versions -d staging)?
Get inside: kamal console -d staging (Rails console), kamal shell -d staging (bash), kamal dbc -d staging (psql), or one-offs: kamal app exec -d staging --reuse 'bin/rails runner "…"'.
Specific symptom? Jump to the matching section below (healthcheck timeouts, versions partition 500s, matviews empty after a restore, OAuth redirect, uid-1001 asset EACCES, …).

The rest of this doc is the symptom-by-symptom runbook; the everyday-command table and day-2 ops live in MANAGING.md, the deploy flow in DEPLOYING.md, the architecture in README.md.

First-look triage

flowchart TD
    s{"What's broken?"}
    s -->|"Site won't load at all"| edge{"curl the box's kamal-proxy<br/>directly — does /up 200?"}
    s -->|"Deploy fails"| dep{"Where did it stop?"}
    s -->|"App up but errors / 500s"| app{"What kind of error?"}
    s -->|"Can't reach over Tailscale"| net["Check tailnet: tailscale status,<br/>use the 100.x IP, edge firewall :22"]

    edge -->|"yes (app fine)"| cf["Edge problem → check cloudflared<br/>+ Cloudflare Access / tunnel"]
    edge -->|"no"| ct["App/proxy down → kamal app details,<br/>docker ps, accessory boot order"]

    dep -->|"before build"| sec["1Password / clean-tree gate"]
    dep -->|"build"| bld["build secrets / .dockerignore / deps"]
    dep -->|"boot / healthcheck"| up["/up + force_ssl + config.hosts"]

    app -->|"DB error"| db["partitions / matviews / restore gaps"]
    app -->|"login blocked"| oauth["OAuth redirect URI"]
    app -->|"EACCES on assets"| uid["uid 1001 mismatch"]

A fast way to split “app vs edge”: curl the proxy on the box, bypassing Cloudflare:

ssh deploy@100.123.47.52 \
  'curl -sS -o /dev/null -w "%{http_code}\n" -H "Host: www.warmlyyours.ws" \
     -H "X-Forwarded-Proto: https" http://localhost:80/up'

200 = app healthy, problem is at the edge. Anything else = app/proxy problem.

Edge / ingress

”Site can’t be reached” / connection refused

:3000 (or any non-standard port) in the URL. :3000 is Puma’s internal container port; Cloudflare only proxies standard web ports. The public URL has no port — https://www.warmlyyours.ws/en-US. The app never emits :3000 links itself. Fix: drop the port / clear the cached bookmark.
Tunnel down. cloudflared runs as a host systemd service, not a container:
Terminal window
```
ssh deploy@100.123.47.52 'systemctl status cloudflared; journalctl -u cloudflared -n 50'
```
Healthy = “active (running)” with QUIC connectors registered. Restart with sudo systemctl restart cloudflared.
Edge is up but app is down. If the direct-proxy curl above is non-200, it’s not the edge — go to Deploy / boot.

A 302 to warmlyyours.cloudflareaccess.com is expected — staging is gated by the wy-employees Access group. Log in with a @warmlyyours.com identity. If a legitimate user is denied, check the Access policy/group in the Cloudflare Zero Trust dashboard (or infra/terraform/cloudflare/).

Google login blocked (`redirect_uri_mismatch`)

New staging hostnames aren’t in the Google OAuth client’s authorized redirect URIs. The CRM login uses Devise’s google_oauth2 provider; the callback is /accounts/auth/google_oauth2/callback. Staging has its own OAuth client (114261933316-b7694…, project “WY API Project”) — not prod’s. Add:

https://crm.warmlyyours.ws/accounts/auth/google_oauth2/callback

to that client’s Authorized redirect URIs in Google Cloud Console. (No JS origins needed — it’s a server-side code flow.) If the consent screen itself blocks, the registrable domain warmlyyours.ws may need adding to the consent screen’s Authorized domains (External apps) — Internal/Workspace apps are exempt.

Deploy / boot

1Password: “couldn’t connect to the 1Password desktop app”

The desktop-app CLI integration is flaky. In order of reliability:

⌘Q the 1Password app fully (not just close the window), reopen, unlock. This clears a stuck CLI-integration helper and is the usual fix. Verify: op vault list. Confirm Settings → Developer → “Integrate with 1Password CLI” is ON.
Manual session in the deploy terminal: eval "$(op signin --account warmlyyours.1password.com)", then re-run.
Service-account token (most reliable — skips the desktop app entirely): save it to .kamal/.op-service-account-token (gitignored). See MANAGING.md → Secrets.

bin/deploy hard-gates on a real op read before the build, so a secret failure surfaces early with the exact op error.

”Working tree is not clean / out of sync”

The clean-tree gate refuses to ship code that matches no pushed commit (Kamal builds the working tree). Commit + push, or use --allow-dirty (throwaway staging test) / --in-worktree (clean master). See DEPLOYING.md.

Build failures

Error	Cause → fix
`sidekiq-pro` 401 during `bundle install`	`BUNDLE_GEMS__CONTRIBSYS__COM` didn’t resolve. Validate `kamal secrets print -d staging`; check `op://IT/Sidekiq-Pro/credential`.
Registry push 401	GHCR auth — `op://IT/GitHub-ghcr-deploy/credential` (needs `write:packages`). The token was refreshed 2026-06; re-check if it expired.
`db/structure.sql doesn't exist`	`.dockerignore` stripped a needed file — un-ignore it.
`vips: not found` / a CLI missing at runtime	Add the package to the final stage’s apt list in the `Dockerfile` (e.g. `libvips-tools`).
webpack rebuilds from scratch every deploy	Known cache-invalidation issue — see `doc/tasks/202606051345_WEBPACK_ASSET_BUILD_SPEEDUP.md` (cache mount + revision decouple; likely folded into the Vite migration).

New container never goes healthy (`/up` times out, proxy gives up)

/up returns 403. config.hosts allow-listing or force_ssl is rejecting the internal health probe. /up must be exempt from both — confirm the exemption in config/environments/*.rb. (This bit us on the first staging deploy.)
Boot is just slow. Puma preload is ~20s; deploy_timeout is 90s. If a heavy boot legitimately needs more, raise deploy_timeout in config/deploy.yml.
DB unreachable at boot. The app can’t boot without its DB. In staging, check the Postgres accessory is up (docker ps). In prod the app connects through the HAProxy write-VIP (DATABASE_HOST=heatwave-haproxy:6433) → the live primary’s pgbouncer → Postgres, so check that chain: is HAProxy showing a healthy backend (http://100.123.47.52:8404/), is the primary’s pg-health returning 200, and is the Postgres accessory up (Dallas primary, or Chicago standby after a flip)? See INFRASTRUCTURE_INVENTORY.md for the routing layout.

App crash-loops after a host reboot

Accessories and the app start independently, so the app may crash-loop until Postgres is ready. restart: unless-stopped retries it, so it usually self-heals. If not, boot the accessories first:

ssh deploy@100.123.47.52 docker ps                  # are postgres + the valkey trio up?
kamal accessory boot postgres        -d staging
kamal accessory boot valkey_cache    -d staging      # + valkey_sessions, valkey_queue
kamal accessory boot valkey_sessions -d staging
kamal accessory boot valkey_queue    -d staging
kamal app boot -d staging

Sidekiq stuck quieted after a failed deploy

pre-deploy quiets Sidekiq (TSTP); on a failed deploy bin/deploy un-quiets it automatically. If you ran bare kamal or the auto-resume failed:

kamal app boot --roles=sidekiq -d staging

Database

Quote builder returns “No matching controls” (but prod is fine)

The view_quote_bom_items matview is empty. After a restore, matviews get refreshed during the schema-only phase (on empty base tables) and, if not re-refreshed against loaded data, stay empty. An empty BOM matview surfaces as “No matching controls” (the elements error is overwritten by the controls one in heating_system_items.rb). Fix:

kamal accessory exec postgres -d staging \
  "psql -U deploy -d heatwave -c 'REFRESH MATERIALIZED VIEW public.view_quote_bom_items;'"

The restore scripts now refresh this critical matview eagerly + verified pre-swap, so a fresh restore can’t reproduce it (see MANAGING.md → Database restore).

`versions` write 500s: “no partition of relation versions found”

PaperTrail’s versions table (separate heatwave_versions DB, FDW-backed) is range-partitioned by year. db/versions_structure.sql historically shipped with no child partitions, so the first write after a fresh load 500s. Fix: create the annual partitions + versions_default, then the structure dump carries them (the pg_party schema_exclude_partitions = false fix, config/initializers/355_pg_party.rb, PR #1031). On a box that’s already broken:

-- one-off, per missing year, in heatwave_versions:
CREATE TABLE versions_2026 PARTITION OF versions
  FOR VALUES FROM ('2026-01-01') TO ('2027-01-01');
CREATE TABLE versions_default PARTITION OF versions DEFAULT;

Analytics dashboards empty after a restore

The ~22 analytics matviews (view_sales_facts, view_opportunities_facts, view_visits_*, …) are refreshed after the swap (deferred, non-blocking) and self-heal via the hourly MatviewRefreshWorker cron. If staging’s scheduler is running they fill in on their own; to force it:

kamal app exec -d staging --reuse \
  'bin/rails runner "MatviewRefreshWorker.new.perform"'

Postgres collation mismatch on accessory boot

initdb collation (e.g. 2.41 vs 2.36) differs between the stock and custom images. The pgdata volume must be initialized under the custom image (ghcr.io/warmlyyours/heatwave-postgres:18, built from docker/postgresql.Dockerfile). If you see a collation-version warning, re-initdb the volume under that image — or, if the data is fine, ALTER DATABASE … REFRESH COLLATION VERSION. Relocating the image between registries does not change collation (the bytes are identical); a mismatch only appears if the pgvector/pgvector:pg18 base was rebuilt with a newer glibc, so prefer re-tagging an existing image over rebuilding when you just need to move registries.

App runtime

Sourcemap upload fails with EACCES (`DELETE_MAPS`)

The asset_path host volume is owned by the host deploy user (uid 1001); a container running as a different uid can’t rewrite/clean the bridged assets. The container app user must be uid 1001 (Dockerfile USER 1001, and cloud-init pins the deploy user to uid 1001). If you see EACCES on the post-deploy sourcemap cleanup, the uids have drifted — check both.

`MissingKeyError: ENV['staging']` (or `['production']`)

Heatwave’s Heatwave::Configuration is keyed on ENV['<environment>'], so each destination needs a staging / production secret = cat config/master.key (present in .kamal/secrets.staging / .kamal/secrets). Validate with kamal secrets print -d staging.

”Sunny is broken” / AI 400s

Unrelated to Kamal — that’s the AppSignal #3808 body-less Gemini 400 class. See the Sunny memory/notes, not this runbook.

Network / access

Can’t SSH to the box. SSH is Tailscale-only (Latitude edge firewall allows :22 from 100.64.0.0/10 only). Confirm you’re on the tailnet (tailscale status) and using the Tailscale IP 100.123.47.52.
psql / mailpit UI unreachable. Same — those bind to 127.0.0.1 (psql) or the Tailscale IP (mailpit :8025). They are never publicly exposed.
A published container port is unexpectedly world-reachable. Docker bypasses UFW for published ports; the DOCKER-USER chain (cloud-init) is what blocks public :80/:443. Verify it loaded: sudo iptables -L DOCKER-USER -n.

Escalation references

Migration status + cutover gates — doc/tasks/202606022303_KAMAL_MIGRATION.md
DB-tier HA topology (Dallas primary / Chicago standby + HAProxy/pgbouncer) — doc/tasks/202606112045_DB_TIER_HA_ARCHITECTURE.md; live host/port reference — doc/infrastructure/INFRASTRUCTURE_INVENTORY.md
Webpack build speedup — doc/tasks/202606051345_WEBPACK_ASSET_BUILD_SPEEDUP.md
Errors / APM — the appsignal skill (AppSignal incidents, traces, logs)
Cloudflare zones/tokens/workers — doc/infrastructure/CLOUDFLARE.md