Infrastructure & Monitoring Inventory
The single-page answer to “what runs where, on which IP and port, what protects
it, and what URL do I use to reach it.” This is hand-maintained reference —
the source of truth is the config it summarizes (config/deploy.yml,
config/deploy.staging.yml, config/netdata/, config/haproxy/,
infra/terraform/). Keep it in sync when those change.
Last verified: 2026-06-15 (PG18, prod-on-Kamal, Dallas-primary / Chicago-standby).
Fleet — physical hosts (where)
Section titled “Fleet — physical hosts (where)”Two Latitude.sh bare-metal boxes (f4.metal.medium: AMD EPYC 4564P 16c/32t,
192 GB RAM, 2×480 GB NVMe RAID1 + 2×2 TB NVMe, 2×10 GbE). The Kamal config
addresses them exclusively by tailnet IP; the public IPs are physical-host
facts (used for SSH-over-tailnet only — there is no public SSH).
| Host | DC | Public IP | Tailnet IP | Role today |
|---|---|---|---|---|
dal-latitude-heatwave-01 | Dallas | 67.213.118.15 | 100.123.47.52 | Prod app + DB primary, and the entire staging stack (co-located) |
chi-latitude-heatwave-02 | Chicago | 186.233.186.45 | 100.68.157.49 | Prod DB standby + Databasus PITR backups (no app today) |
At a planned Chicago cutover (“W3”) the app role flips to chi-02 via
pg_promote + HAProxy reroute — see Failover & maintenance tooling below.
Today, all production web/Sidekiq traffic is served from Dallas.
Internet │ (HTTPS) ┌───────▼────────┐ │ Cloudflare │ edge TLS, www-edge Worker, WAF └───────┬────────┘ │ outbound-only QUIC tunnel (cloudflared, host systemd) ┌───────────▼───────────── dal-01 (Dallas, 100.123.47.52) ──────────────┐ │ kamal-proxy :80 ─▶ web / sidekiq (Puma, Thruster) │ │ │ │ │ ▼ DATABASE_HOST=heatwave-haproxy:6433 │ │ HAProxy :6433 (TCP write-VIP) ─▶ pgbouncer :6432 ─▶ Postgres :5432 │ │ ▲ httpchk pg-health :8008 (PG18 PRIMARY) │ │ Valkey ×3 (cache / sessions / queue) · Playwright · SFTPGo · Netdata │ └───────────────────────────────┬───────────────────────────────────────┘ │ streaming replication (slot chicago_standby) ┌────────────────────────────────▼──── chi-02 (Chicago, 100.68.157.49) ──┐ │ Postgres :5432 (PG18 STANDBY) ◀ pgbouncer :6432 (RO VIP heatwave-db-ro)│ │ pg-health :8008 · Netdata · Databasus agent (PITR → R2) + UI :4005 │ └─────────────────────────────────────────────────────────────────────────┘Administrative services — URLs to use (how to access)
Section titled “Administrative services — URLs to use (how to access)”Get on the tailnet first (tailscale up; the boxes only accept SSH and
admin UIs from 100.64.0.0/10). Admin logins live in 1Password vault IT.
| Service | URL | Reach / gating | What it shows |
|---|---|---|---|
| Netdata — prod (Dallas) | http://100.123.47.52:19999 | tailnet only | Per-second host + every container + Postgres (primary) + Valkey ×3 + pgbouncer + HAProxy + systemd units |
| Netdata — standby (Chicago) | http://100.68.157.49:19999 | tailnet only | Standby host + Postgres replica (recovery state / replication lag) + RO pgbouncer |
| Netdata — staging | ssh -L 19999:localhost:19999 deploy@100.123.47.52 → http://localhost:19999 | localhost (SSH-forward) | Staging stack (prod owns the tailnet :19999 on the shared box) |
| PgHero | https://crm.warmlyyours.com/pghero | Rails admin login (crm.* host, is_admin?) | Slow queries (pg_stat_statements), table/index bloat, live queries, vacuum, index suggestions (hypopg) |
| Sidekiq Web UI | https://crm.warmlyyours.com/sidekiq | Rails admin login | Background-job queues, retries, scheduled set |
| Rails Event Store browser | https://crm.warmlyyours.com/res | Rails admin login | The RES domain-event stream |
| HAProxy stats | http://100.123.47.52:8404/ | tailnet only | DB write-VIP backend health (which node is “up”) |
| SFTPGo admin | http://100.123.47.52:8080 | tailnet only | SFTP users/sessions (call-records + PBX backups → R2) |
| Databasus (PITR controller) | http://100.68.157.49:4005 | tailnet only | Configure/monitor backup sources; trigger PITR restores. Login op://IT/Databasus (Postgres Backup) |
| Mailpit — staging | http://100.123.47.52:8025 | tailnet only | Captured staging outbound mail |
| Mailpit — dev | http://localhost:8025 | local docker-compose | Captured dev outbound mail |
| AppSignal | https://appsignal.com/warmlyyours | SaaS login | APM, exceptions, traces (apps Heatwave/production + Heatwave/staging) |
| HyperDX | — | planned, not deployed | Future app traces/logs/errors (ClickHouse; will replace AppSignal) |
Service topology (what runs where) — production
Section titled “Service topology (what runs where) — production”All ports below are bound to the tailnet IP unless noted. App→service wiring
goes over the internal kamal Docker network by DNS name, not these host ports
(host ports are for operators). Image tags here are indicative —
config/deploy.yml is authoritative for exact tags/digests (e.g. the haproxy
@sha256 pin).
Dallas — dal-latitude-heatwave-01 (100.123.47.52)
Section titled “Dallas — dal-latitude-heatwave-01 (100.123.47.52)”| Service (container) | Port (host→ctr) | Image | Purpose |
|---|---|---|---|
kamal-proxy | (CF tunnel → :80) | kamal-proxy | Rolling-deploy reverse proxy; health /up; routes crm/www/api/scan/mcp.warmlyyours.com |
| web / sidekiq | none published | ghcr.io/warmlyyours/heatwave | Rails (Puma via Thruster :80→:3000); one consolidated Sidekiq |
heatwave-postgres (primary) | 5432 | ghcr.io/warmlyyours/heatwave-postgres:18-noble | PG18.4 RW primary; data on ZFS tank/prod-replica |
heatwave-pgbouncer (RW) | 6432 | …/heatwave-pgbouncer:1.25.2 | Session-mode pooler in front of the primary |
heatwave-haproxy (write VIP) | 6433 (+ 8404 stats) | haproxy:3.0-alpine | TCP failover router — the app’s DATABASE_HOST; httpchk on each node’s pg-health keeps only the live primary “up” |
heatwave-pg-health | 8008 | …/heatwave-pg-health:v3 | HTTP leader probe (200 primary / 503 standby) for HAProxy |
heatwave-valkey-cache | none (kamal-net) | valkey/valkey:9.1 | Logical DBs 1,2,4,5 · allkeys-lru · no persistence |
heatwave-valkey-sessions | none | valkey/valkey:9.1 | Logical DB 0 · noeviction · no persistence |
heatwave-valkey-queue | none | valkey/valkey:9.1 | Logical DB 3 (Sidekiq) · noeviction + AOF/RDB (durable) |
heatwave-sftp (SFTPGo) | 2222→2022 (public, PBX-only); 8080 UI (tailnet) | drakkan/sftpgo:v2.6.6 | Switchvox call-records + PBX-backup drop → Cloudflare R2 |
heatwave-playwright | none (:3000) | mcr.microsoft.com/playwright:v1.60.0-noble | Headless browser for server-side PDF/email/upload flows |
heatwave-netdata | 19999 | netdata/netdata:v2.10.3 | Per-second observability (host + all of the above) |
Chicago — chi-latitude-heatwave-02 (100.68.157.49)
Section titled “Chicago — chi-latitude-heatwave-02 (100.68.157.49)”| Service (container) | Port | Image | Purpose |
|---|---|---|---|
heatwave-postgres-replica (standby) | 5432 | …/heatwave-postgres:18-noble | PG18 streaming standby (slot chicago_standby); read-offload target; promoted at flip |
heatwave-pgbouncer-replica (RO) | 6432 | …/heatwave-pgbouncer:1.25.2 | RO pooler behind the heatwave-db-ro tailnet VIP |
heatwave-pg-health-replica | 8008 | …/heatwave-pg-health:v3 | Standby recovery-state probe for HAProxy |
heatwave-netdata-replica | 19999 | netdata/netdata:v2.10.3 | Standby host + replica lag |
| Databasus controller | 4005 (UI) | databasus/databasus:v3.46.0 (official) | Agentless PITR backups (controller-driven pg_basebackup + pg_receivewal) → R2 |
Container images come from GHCR (ghcr.io/warmlyyours/…); registry auth is
per-developer via the gh CLI (no shared PAT). Secrets resolve through Kamal’s
1Password adapter (vault IT) — only names are referenced in config, never
values.
Port-exposure map (the crux)
Section titled “Port-exposure map (the crux)”| Port | Bind | Service | Host | Notes |
|---|---|---|---|---|
| 80/443 | CF tunnel only | web (kamal-proxy) | Dallas | No inbound web port open; DOCKER-USER drops public 80/443 |
| 2222 | public → PBX IP only | SFTPGo SSH | Dallas | Firewalled to 144.202.57.170 (Switchvox) via edge fw + DOCKER-USER --ctorigdstport |
| 22 | tailnet | SSH | both | Latitude edge fw + UFW allow only 100.64.0.0/10 |
| 5432 | tailnet | Postgres | both | primary (Dallas) / standby (Chicago) |
| 6432 | tailnet | pgbouncer | both | RW (Dallas) / RO (Chicago) |
| 6433 | tailnet | HAProxy write-VIP | Dallas | app DB target |
| 8008 | tailnet | pg-health | both | HAProxy httpchk |
| 8404 | tailnet | HAProxy stats / /metrics | Dallas | stats UI + netdata scrape |
| 8080 | tailnet | SFTPGo admin UI | Dallas | |
| 19999 | tailnet | Netdata | both | |
| 4005 | tailnet | Databasus UI | Chicago | |
| 8025 | tailnet | Mailpit UI (staging) | Dallas | SMTP 1025 stays kamal-net-internal |
| 6379 / 3000 | kamal-net only | Valkey ×3 / Playwright | Dallas | not host-published |
Data tier (DB read/write path)
Section titled “Data tier (DB read/write path)”- Write path: app
DATABASE_HOST=heatwave-haproxy,DATABASE_PORT=6433→ HAProxy (TCP passthrough) → the live primary’s pgbouncer:6432(session-mode — the app uses session advisory locks) → Postgres:5432. Bothheatwaveandheatwave_versionsshare one FDW-linked cluster. - Routing decision: HAProxy
httpchks each node’s pg-health:8008; only the node answering200 "primary"is “up”, so apg_promotere-routes the write path with no app redeploy. - Read offload: the
heatwave-db-roTailscale VIP follows the standby’s pgbouncer for read-only consumers. - Cache/queue:
RedisConfig(config/initializers/100_redis_config.rb) routes by logical DB toheatwave-valkey-{cache,sessions,queue}:6379— see Valkey 3-Flavor Split and DB Tier HA Architecture.
Failover & maintenance tooling
Section titled “Failover & maintenance tooling”bin/recovery <env> {flip-db,rebuild-standby,topology}— promote the standby + reroute the write VIP (cross-DC), rebuild a wiped node from a fresh basebackup, or print the current topology. Snapshots the demoted dataset (zfs snapshot) before wiping, as a rollback net.bin/maintenance {up,down} <env>— full maintenance window (proxy 503 → stop web+sidekiq → on prod also stop the Chicagodatabasuscontroller container), and the reverse. See the PG18 Failover Runbook and HAProxy Routing Layer.
Backups & disaster recovery
Section titled “Backups & disaster recovery”- Databasus PITR (backup-of-record) — the agentless
databasuscontroller container on Chicago streams an encrypted physical base + WAL to Cloudflare R2 bucketheatwave-postgres-backups-production(ENAM region, off-Latitude for DR). Restore to any chosen second via the controller UI (:4005) orbin/restore’s physical/PITR option (wrapsconfig/databasus/databasus-recovery.sh). AES-256-GCM key at/data/databasus-data/secret.key. - Databasus is itself disaster-recoverable (config-as-code) — the controller
keeps its entire config (admin / workspace / R2 storage / sources / schedules /
restore user) only in an embedded Postgres metadata DB on the host, so a host wipe
loses it (it did, 2026-06-15) and only
secret.keyis in 1Password.script/setup_databasus.shrebuilds the whole thing idempotently in one command; a nightly encrypted metadata snapshot (databasus-metadata-backup.timer→ R2databasus-metadata/) is the turnkey alternative. Seeconfig/databasus/README.md. heatwave_versionsarchive —bin/versions-partitionsships completed annual partitions (>5 yr) to R2heatwave-versions-archive-production(cold).bin/restore— pulls a recent logicalpg_dumpfrom Databasus→R2 to seed a dev database (this is the dev path, not DR). See the DR restore runbook and Databasus PITR.
Cloudflare R2 buckets (all off-Latitude; bucket-scoped S3 tokens in 1Password vault IT)
Section titled “Cloudflare R2 buckets (all off-Latitude; bucket-scoped S3 tokens in 1Password vault IT)”| Bucket | Region | Purpose | Producer |
|---|---|---|---|
heatwave-postgres-backups-production | ENAM | Databasus PITR (physical base + WAL). databasus-metadata/ prefix = the encrypted controller-config snapshots | Databasus agent + backup-metadata.sh |
heatwave-versions-archive-production | ENAM | Cold annual heatwave_versions partitions (>5 yr) | bin/versions-partitions |
heatwave-frontend-assets-production | ENAM | Content-hashed webpack assets, served same-origin by the www-edge Worker (survives Kamal deploys) | webpack build / deploy |
heatwave-call-recordings-production | — | Switchvox call recordings (WarmlyYours/ prefix); Sidekiq imports from here | SFTPGo pbx user |
heatwave-pbx-backups-production | ENAM | Switchvox PBX system backups (bucket root) | SFTPGo pbx-backup user |
R2 bucket location is pinned on first creation of a name — delete+recreate reuses the original location (
--location enamonly applies to a fresh name). Tokens are minted viascript/setup_r2_*and stored in 1Password.
Edge & network protection (what protects it)
Section titled “Edge & network protection (what protects it)”- Ingress: Cloudflare → outbound-only
cloudflaredQUIC tunnel (host systemd) →kamal-proxy :80. No inbound web port is open on any host. TLS terminates at Cloudflare (proxy.ssl: false); thewww-edgeWorker frontswww/apex for locale redirects, R2 webpack-asset serving, and cache rules. - Firewall — three layers:
- Latitude edge firewall — allow SSH 22 from
100.64.0.0/10(Tailscale) and SFTP 2222 from144.202.57.170(PBX); deny the rest. - Host UFW —
default deny incoming; allowlo,tailscale0,22/tcp, and2222from the PBX IP. - DOCKER-USER iptables (
/usr/local/sbin/docker-user-fw.sh) — drops public80/443, allows the tailnet, and handles the SFTP DNAT gotcha (matchconntrack --ctorigdstport 2222, not--dport, because Docker DNATs2222→2022before the rule sees it).
- Latitude edge firewall — allow SSH 22 from
- Cloudflare Access: staging only —
crm/www/api/mcp.warmlyyours.wsare gated to thewy-employeesgroup (24 h session). Production has no CF Access by design (public site; Rails handles its own auth). The planneddocs.warmlyyours.devportal will be gated to@warmlyyours.com. - Tailnet: all operator surfaces (SSH, Netdata, HAProxy stats, SFTPGo UI,
Databasus, Mailpit,
psql) are Tailscale-only. Stable VIPsheatwave-db(→primary) andheatwave-db-ro(→standby) survive CHI↔DAL flips. The Tailscale ACL is Terraform-managed (infra/terraform/tailscale/).
Infrastructure as code (Terraform Cloud)
Section titled “Infrastructure as code (Terraform Cloud)”All infra is OpenTofu/Terraform under infra/terraform/, applied by HCP Terraform
(org warmlyyours), VCS-driven from this repo (auto-apply off — plans are reviewed,
applies are manual).
| Workspace | Module | Manages |
|---|---|---|
heatwave-latitude-production | latitude/ | The Latitude bare-metal server + cloud-init + per-host edge firewall |
heatwave-host-config | host-config/ | Re-runs the idempotent provision-host.sh over SSH (postfix relay, pg-maintenance/logwatch timers, ZED) — no reinstall |
heatwave-tailscale | tailscale/ | Tailnet ACL + the heatwave-db / heatwave-db-ro VIP services |
heatwave-cloudflare-zone-{production,staging} | cloudflare-zone-*/ | CF zone rulesets (WAF, cache, transforms) — the IaC source of truth; the dashboard is read-only |
- TFC agent: a
hashicorp/tfc-agentcontainer on Chicago (--network host, poolheatwave-tailnet) runs the agent-mode workspaces’ SSH provisioners — the one thing a hosted runner can’t do to a tailnet-only host. Dials out toapp.terraform.ioonly. It is re-established by re-running the container (tokenop://IT/TFC-agent-token (heatwave-tailnet)). - ⚠️ Latitude reinstall landmine (now guarded): a
user_datachange onlatitudesh_server.hosttriggers a full server reinstall (all host data lost). On 2026-06-15 an approved plan whose only effective diff was aprovision-host.shedit reinstalled the live Chicago standby — wiping the PG standby + the Databasus agent and killing the in-flight TFC agent (it runs on Chicago).latitude/main.tfnow carrieslifecycle { ignore_changes = [user_data, billing] }, so editing cloud-init / provision-host.sh is inert w.r.t. a running box (day-2 host config flows throughhost-configover SSH). A genuine rebuild is now explicit:tofu apply -replace=latitudesh_server.host. Full record:doc/tasks/202606151240_CHICAGO_REINSTALL_DR_RECOVERY.md.
Application observability
Section titled “Application observability”| Tool | Status | Notes |
|---|---|---|
| AppSignal | Live | Current APM / exception / host-metric sink (Heatwave/production + /staging) |
| Netdata | Live | Per-second infra metrics; per-host agents (not a parent/child stream) |
| PgHero | Live | Rails-mounted PG performance dashboard |
| HyperDX | Planned | ClickHouse-backed traces/logs; intended to replace AppSignal; not in deploy.yml yet |
ZFS pool DEGRADED/FAULTED alerting is handled by ZED (ZFS Event Daemon) over the postfix→SendGrid relay, not Netdata (the container→host firewall blocks zpool collection).
Staging (summary)
Section titled “Staging (summary)”Staging co-locates on the Dallas box under the heatwave-staging- service
prefix. It mirrors the full prod topology (primary + same-host standby for
HAProxy-failover rehearsal, Valkey ×3, pgbouncer, HAProxy), but every port binds
127.0.0.1 (except Mailpit/Netdata on the tailnet, since prod owns the
tailnet :19999). Staging hostnames crm/www/api/mcp.warmlyyours.ws are behind
Cloudflare Access (wy-employees). Staging sends no real mail — it’s
captured by Mailpit (see the admin-services table above).
Keeping this current
Section titled “Keeping this current”When config/deploy.yml / deploy.staging.yml, config/netdata/,
config/haproxy/, or infra/terraform/ change, update this page. The facts
here were extracted from those files on the date above; they are the source of
truth, this is the index.