Kamal Stack — Troubleshooting Runbook
Symptom → cause → fix for the failure modes we’ve actually hit. Commands assume
mise exec -- bundle exec kamal (abbreviated kamal), -d staging for staging.
Box access is over Tailscale (ssh deploy@100.123.47.52).
Logs & troubleshooting — developer quick reference
Section titled “Logs & troubleshooting — developer quick reference”Two prerequisites for every command below:
- Be on Tailscale. Every box is reachable only over the tailnet — no public SSH
or ports.
tailscale statusshould listdal-latitude-heatwave-01(Dallas — prod + staging,100.123.47.52) andchi-latitude-heatwave-02(Chicago — standby,100.68.157.49). - Prefix kamal with the toolchain + pick the env:
mise exec -- bundle exec kamal <cmd> -d staging(or-d production).
Looking at logs — there are four layers
Section titled “Looking at logs — there are four layers”| Layer | Command | What you see |
|---|---|---|
| App (web + sidekiq) | kamal app logs -d staging -f | Rails/Puma + Sidekiq stdout (live). --roles=web / --roles=sidekiq to split; -n 200 for backlog; --grep ERROR to filter |
| Accessories (postgres / valkey ×3 / mailpit) | kamal accessory logs postgres -d staging -f (or valkey_cache / valkey_sessions / valkey_queue) | DB / cache / queue / mail-sink container logs |
| Edge connector (cloudflared, host systemd unit) | ssh deploy@100.123.47.52 'journalctl -u cloudflared -n 50 -f' | Tunnel up/down, ingress routing |
| Errors / APM | AppSignal apps Heatwave/staging & Heatwave/production | Exceptions, traces, slow requests — richer than tailing |
Raw fallback on the box (when you need everything Docker sees):
ssh deploy@100.123.47.52, then docker ps, docker logs <container> --tail 100 -f,
docker stats. (Dallas/app 100.123.47.52, Chicago/standby 100.68.157.49.)
Troubleshooting flow
Section titled “Troubleshooting flow”- App or edge? Curl the proxy on the box, bypassing Cloudflare:
Terminal window ssh deploy@100.123.47.52 'curl -s -o /dev/null -w "%{http_code}\n" \-H "Host: www.warmlyyours.ws" -H "X-Forwarded-Proto: https" http://localhost/up'200→ the app is fine; the problem is at the edge (cloudflared / Cloudflare Access / tunnel). Anything else → app / proxy (next steps). - What’s running?
kamal app details -d staging(+docker pson the box) — healthy? crash-looping? which image version (kamal app versions -d staging)? - Get inside:
kamal console -d staging(Rails console),kamal shell -d staging(bash),kamal dbc -d staging(psql), or one-offs:kamal app exec -d staging --reuse 'bin/rails runner "…"'. - Specific symptom? Jump to the matching section below (healthcheck timeouts,
versionspartition 500s, matviews empty after a restore, OAuth redirect, uid-1001 asset EACCES, …).
The rest of this doc is the symptom-by-symptom runbook; the everyday-command table and
day-2 ops live in MANAGING.md, the deploy flow in
DEPLOYING.md, the architecture in README.md.
First-look triage
Section titled “First-look triage”flowchart TD s{"What's broken?"} s -->|"Site won't load at all"| edge{"curl the box's kamal-proxy<br/>directly — does /up 200?"} s -->|"Deploy fails"| dep{"Where did it stop?"} s -->|"App up but errors / 500s"| app{"What kind of error?"} s -->|"Can't reach over Tailscale"| net["Check tailnet: tailscale status,<br/>use the 100.x IP, edge firewall :22"]
edge -->|"yes (app fine)"| cf["Edge problem → check cloudflared<br/>+ Cloudflare Access / tunnel"] edge -->|"no"| ct["App/proxy down → kamal app details,<br/>docker ps, accessory boot order"]
dep -->|"before build"| sec["1Password / clean-tree gate"] dep -->|"build"| bld["build secrets / .dockerignore / deps"] dep -->|"boot / healthcheck"| up["/up + force_ssl + config.hosts"]
app -->|"DB error"| db["partitions / matviews / restore gaps"] app -->|"login blocked"| oauth["OAuth redirect URI"] app -->|"EACCES on assets"| uid["uid 1001 mismatch"]A fast way to split “app vs edge”: curl the proxy on the box, bypassing Cloudflare:
ssh deploy@100.123.47.52 \ 'curl -sS -o /dev/null -w "%{http_code}\n" -H "Host: www.warmlyyours.ws" \ -H "X-Forwarded-Proto: https" http://localhost:80/up'200 = app healthy, problem is at the edge. Anything else = app/proxy problem.
Edge / ingress
Section titled “Edge / ingress””Site can’t be reached” / connection refused
Section titled “”Site can’t be reached” / connection refused”:3000(or any non-standard port) in the URL.:3000is Puma’s internal container port; Cloudflare only proxies standard web ports. The public URL has no port —https://www.warmlyyours.ws/en-US. The app never emits:3000links itself. Fix: drop the port / clear the cached bookmark.- Tunnel down.
cloudflaredruns as a host systemd service, not a container:Healthy = “active (running)” with QUIC connectors registered. Restart withTerminal window ssh deploy@100.123.47.52 'systemctl status cloudflared; journalctl -u cloudflared -n 50'sudo systemctl restart cloudflared. - Edge is up but app is down. If the direct-proxy curl above is non-200, it’s not the edge — go to Deploy / boot.
Stuck at the Cloudflare Access login
Section titled “Stuck at the Cloudflare Access login”A 302 to warmlyyours.cloudflareaccess.com is expected — staging is gated by
the wy-employees Access group. Log in with a @warmlyyours.com identity. If a
legitimate user is denied, check the Access policy/group in the Cloudflare Zero
Trust dashboard (or infra/terraform/cloudflare/).
Google login blocked (redirect_uri_mismatch)
Section titled “Google login blocked (redirect_uri_mismatch)”New staging hostnames aren’t in the Google OAuth client’s authorized redirect URIs.
The CRM login uses Devise’s google_oauth2 provider; the callback is
/accounts/auth/google_oauth2/callback. Staging has its own OAuth client
(114261933316-b7694…, project “WY API Project”) — not prod’s. Add:
https://crm.warmlyyours.ws/accounts/auth/google_oauth2/callbackto that client’s Authorized redirect URIs in Google Cloud Console. (No JS origins
needed — it’s a server-side code flow.) If the consent screen itself blocks, the
registrable domain warmlyyours.ws may need adding to the consent screen’s
Authorized domains (External apps) — Internal/Workspace apps are exempt.
Deploy / boot
Section titled “Deploy / boot”1Password: “couldn’t connect to the 1Password desktop app”
Section titled “1Password: “couldn’t connect to the 1Password desktop app””The desktop-app CLI integration is flaky. In order of reliability:
- ⌘Q the 1Password app fully (not just close the window), reopen, unlock.
This clears a stuck CLI-integration helper and is the usual fix. Verify:
op vault list. Confirm Settings → Developer → “Integrate with 1Password CLI” is ON. - Manual session in the deploy terminal:
eval "$(op signin --account warmlyyours.1password.com)", then re-run. - Service-account token (most reliable — skips the desktop app entirely): save
it to
.kamal/.op-service-account-token(gitignored). See MANAGING.md → Secrets.
bin/deploy hard-gates on a real op read before the build, so a secret
failure surfaces early with the exact op error.
”Working tree is not clean / out of sync”
Section titled “”Working tree is not clean / out of sync””The clean-tree gate refuses to ship code that matches no pushed commit (Kamal builds
the working tree). Commit + push, or use --allow-dirty (throwaway staging test) /
--in-worktree (clean master). See DEPLOYING.md.
Build failures
Section titled “Build failures”| Error | Cause → fix |
|---|---|
sidekiq-pro 401 during bundle install | BUNDLE_GEMS__CONTRIBSYS__COM didn’t resolve. Validate kamal secrets print -d staging; check op://IT/Sidekiq-Pro/credential. |
| Registry push 401 | GHCR auth — op://IT/GitHub-ghcr-deploy/credential (needs write:packages). The token was refreshed 2026-06; re-check if it expired. |
db/structure.sql doesn't exist | .dockerignore stripped a needed file — un-ignore it. |
vips: not found / a CLI missing at runtime | Add the package to the final stage’s apt list in the Dockerfile (e.g. libvips-tools). |
| webpack rebuilds from scratch every deploy | Known cache-invalidation issue — see doc/tasks/202606051345_WEBPACK_ASSET_BUILD_SPEEDUP.md (cache mount + revision decouple; likely folded into the Vite migration). |
New container never goes healthy (/up times out, proxy gives up)
Section titled “New container never goes healthy (/up times out, proxy gives up)”/upreturns 403.config.hostsallow-listing orforce_sslis rejecting the internal health probe./upmust be exempt from both — confirm the exemption inconfig/environments/*.rb. (This bit us on the first staging deploy.)- Boot is just slow. Puma preload is ~20s;
deploy_timeoutis 90s. If a heavy boot legitimately needs more, raisedeploy_timeoutinconfig/deploy.yml. - DB unreachable at boot. The app can’t boot without its DB. In staging, check
the Postgres accessory is up (
docker ps). In prod the app connects through the HAProxy write-VIP (DATABASE_HOST=heatwave-haproxy:6433) → the live primary’s pgbouncer → Postgres, so check that chain: is HAProxy showing a healthy backend (http://100.123.47.52:8404/), is the primary’spg-healthreturning200, and is the Postgres accessory up (Dallas primary, or Chicago standby after a flip)? SeeINFRASTRUCTURE_INVENTORY.mdfor the routing layout.
App crash-loops after a host reboot
Section titled “App crash-loops after a host reboot”Accessories and the app start independently, so the app may crash-loop until
Postgres is ready. restart: unless-stopped retries it, so it usually self-heals.
If not, boot the accessories first:
ssh deploy@100.123.47.52 docker ps # are postgres + the valkey trio up?kamal accessory boot postgres -d stagingkamal accessory boot valkey_cache -d staging # + valkey_sessions, valkey_queuekamal accessory boot valkey_sessions -d stagingkamal accessory boot valkey_queue -d stagingkamal app boot -d stagingSidekiq stuck quieted after a failed deploy
Section titled “Sidekiq stuck quieted after a failed deploy”pre-deploy quiets Sidekiq (TSTP); on a failed deploy bin/deploy un-quiets it
automatically. If you ran bare kamal or the auto-resume failed:
kamal app boot --roles=sidekiq -d stagingDatabase
Section titled “Database”Quote builder returns “No matching controls” (but prod is fine)
Section titled “Quote builder returns “No matching controls” (but prod is fine)”The view_quote_bom_items matview is empty. After a restore, matviews get
refreshed during the schema-only phase (on empty base tables) and, if not
re-refreshed against loaded data, stay empty. An empty BOM matview surfaces as
“No matching controls” (the elements error is overwritten by the controls one in
heating_system_items.rb). Fix:
kamal accessory exec postgres -d staging \ "psql -U deploy -d heatwave -c 'REFRESH MATERIALIZED VIEW public.view_quote_bom_items;'"The restore scripts now refresh this critical matview eagerly + verified pre-swap, so a fresh restore can’t reproduce it (see MANAGING.md → Database restore).
versions write 500s: “no partition of relation versions found”
Section titled “versions write 500s: “no partition of relation versions found””PaperTrail’s versions table (separate heatwave_versions DB, FDW-backed) is
range-partitioned by year. db/versions_structure.sql historically shipped with
no child partitions, so the first write after a fresh load 500s. Fix: create the
annual partitions + versions_default, then the structure dump carries them (the
pg_party schema_exclude_partitions = false fix, config/initializers/355_pg_party.rb,
PR #1031). On a box that’s already broken:
-- one-off, per missing year, in heatwave_versions:CREATE TABLE versions_2026 PARTITION OF versions FOR VALUES FROM ('2026-01-01') TO ('2027-01-01');CREATE TABLE versions_default PARTITION OF versions DEFAULT;Analytics dashboards empty after a restore
Section titled “Analytics dashboards empty after a restore”The ~22 analytics matviews (view_sales_facts, view_opportunities_facts,
view_visits_*, …) are refreshed after the swap (deferred, non-blocking) and
self-heal via the hourly MatviewRefreshWorker cron. If staging’s scheduler is
running they fill in on their own; to force it:
kamal app exec -d staging --reuse \ 'bin/rails runner "MatviewRefreshWorker.new.perform"'Postgres collation mismatch on accessory boot
Section titled “Postgres collation mismatch on accessory boot”initdb collation (e.g. 2.41 vs 2.36) differs between the stock and custom images.
The pgdata volume must be initialized under the custom image
(ghcr.io/warmlyyours/heatwave-postgres:18, built from docker/postgresql.Dockerfile).
If you see a collation-version warning, re-initdb the volume under that image — or,
if the data is fine, ALTER DATABASE … REFRESH COLLATION VERSION. Relocating the
image between registries does not change collation (the bytes are identical); a
mismatch only appears if the pgvector/pgvector:pg18 base was rebuilt with a newer
glibc, so prefer re-tagging an existing image over rebuilding when you just need to
move registries.
App runtime
Section titled “App runtime”Sourcemap upload fails with EACCES (DELETE_MAPS)
Section titled “Sourcemap upload fails with EACCES (DELETE_MAPS)”The asset_path host volume is owned by the host deploy user (uid 1001); a
container running as a different uid can’t rewrite/clean the bridged assets. The
container app user must be uid 1001 (Dockerfile USER 1001, and cloud-init
pins the deploy user to uid 1001). If you see EACCES on the post-deploy sourcemap
cleanup, the uids have drifted — check both.
MissingKeyError: ENV['staging'] (or ['production'])
Section titled “MissingKeyError: ENV['staging'] (or ['production'])”Heatwave’s Heatwave::Configuration is keyed on ENV['<environment>'], so each
destination needs a staging / production secret = cat config/master.key
(present in .kamal/secrets.staging / .kamal/secrets). Validate with
kamal secrets print -d staging.
”Sunny is broken” / AI 400s
Section titled “”Sunny is broken” / AI 400s”Unrelated to Kamal — that’s the AppSignal #3808 body-less Gemini 400 class. See the Sunny memory/notes, not this runbook.
Network / access
Section titled “Network / access”- Can’t SSH to the box. SSH is Tailscale-only (Latitude edge firewall allows
:22from100.64.0.0/10only). Confirm you’re on the tailnet (tailscale status) and using the Tailscale IP100.123.47.52. psql/ mailpit UI unreachable. Same — those bind to127.0.0.1(psql) or the Tailscale IP (mailpit:8025). They are never publicly exposed.- A published container port is unexpectedly world-reachable. Docker bypasses UFW
for published ports; the
DOCKER-USERchain (cloud-init) is what blocks public:80/:443. Verify it loaded:sudo iptables -L DOCKER-USER -n.
Escalation references
Section titled “Escalation references”- Migration status + cutover gates —
doc/tasks/202606022303_KAMAL_MIGRATION.md - DB-tier HA topology (Dallas primary / Chicago standby + HAProxy/pgbouncer) —
doc/tasks/202606112045_DB_TIER_HA_ARCHITECTURE.md; live host/port reference —doc/infrastructure/INFRASTRUCTURE_INVENTORY.md - Webpack build speedup —
doc/tasks/202606051345_WEBPACK_ASSET_BUILD_SPEEDUP.md - Errors / APM — the
appsignalskill (AppSignal incidents, traces, logs) - Cloudflare zones/tokens/workers —
doc/infrastructure/CLOUDFLARE.md