Infrastructure & Monitoring Inventory
The single-page answer to "what runs where, on which IP and port, what protects
it, and what URL do I use to reach it." This is hand-maintained reference —
the source of truth is the config it summarizes (config/deploy.yml,
config/deploy.staging.yml, config/netdata/, config/haproxy/,
infra/terraform/). Keep it in sync when those change.
Last verified: 2026-06-15 (PG18, prod-on-Kamal, Dallas-primary / Chicago-standby).
:::note[How to read the "reach" column]
- tailnet = reachable only over Tailscale (the box's
100.xIP); not public. - localhost = bound to
127.0.0.1on the box; reach via SSH port-forward. - public = bound to
0.0.0.0; the only such port in the fleet is SFTP2222, and it's firewalled to the PBX source IP. - CF tunnel = arrives via an outbound-only Cloudflare Tunnel →
kamal-proxy; no inbound web port is open on any host.
:::
Fleet — physical hosts (where)
Two Latitude.sh bare-metal boxes (f4.metal.medium: AMD EPYC 4564P 16c/32t,
192 GB RAM, 2×480 GB NVMe RAID1 + 2×2 TB NVMe, 2×10 GbE). The Kamal config
addresses them exclusively by tailnet IP; the public IPs are physical-host
facts (used for SSH-over-tailnet only — there is no public SSH).
| Host | DC | Public IP | Tailnet IP | Role today |
|---|---|---|---|---|
dal-latitude-heatwave-01 |
Dallas | 67.213.118.15 |
100.123.47.52 |
Prod app + DB primary, and the entire staging stack (co-located) |
chi-latitude-heatwave-02 |
Chicago | 186.233.186.45 |
100.68.157.49 |
Prod DB standby + Databasus PITR backups (no app today) |
At a planned Chicago cutover ("W3") the app role flips to chi-02 via
pg_promote + HAProxy reroute — see Failover & maintenance tooling below.
Today, all production web/Sidekiq traffic is served from Dallas.
Internet
│ (HTTPS)
┌───────▼────────┐
│ Cloudflare │ edge TLS, www-edge Worker, WAF
└───────┬────────┘
│ outbound-only QUIC tunnel (cloudflared, host systemd)
┌───────────▼───────────── dal-01 (Dallas, 100.123.47.52) ──────────────┐
│ kamal-proxy :80 ─▶ web / sidekiq (Puma, Thruster) │
│ │ │
│ ▼ DATABASE_HOST=heatwave-haproxy:6433 │
│ HAProxy :6433 (TCP write-VIP) ─▶ pgbouncer :6432 ─▶ Postgres :5432 │
│ ▲ httpchk pg-health :8008 (PG18 PRIMARY) │
│ Valkey ×3 (cache / sessions / queue) · Playwright · SFTPGo · Netdata │
└───────────────────────────────┬───────────────────────────────────────┘
│ streaming replication (slot chicago_standby)
┌────────────────────────────────▼──── chi-02 (Chicago, 100.68.157.49) ──┐
│ Postgres :5432 (PG18 STANDBY) ◀ pgbouncer :6432 (RO VIP heatwave-db-ro)│
│ pg-health :8008 · Netdata · Databasus agent (PITR → R2) + UI :4005 │
└─────────────────────────────────────────────────────────────────────────┘
Administrative services — URLs to use (how to access)
Get on the tailnet first (tailscale up; the boxes only accept SSH and
admin UIs from 100.64.0.0/10). Admin logins live in 1Password vault IT.
| Service | URL | Reach / gating | What it shows |
|---|---|---|---|
| Netdata — prod (Dallas) | http://100.123.47.52:19999 |
tailnet only | Per-second host + every container + Postgres (primary) + Valkey ×3 + pgbouncer + HAProxy + systemd units |
| Netdata — standby (Chicago) | http://100.68.157.49:19999 |
tailnet only | Standby host + Postgres replica (recovery state / replication lag) + RO pgbouncer |
| Netdata — staging | ssh -L 19999:localhost:19999 deploy@100.123.47.52 → http://localhost:19999 |
localhost (SSH-forward) | Staging stack (prod owns the tailnet :19999 on the shared box) |
| PgHero | https://crm.warmlyyours.com/pghero |
Rails admin login (crm.* host, is_admin?) |
Slow queries (pg_stat_statements), table/index bloat, live queries, vacuum, index suggestions (hypopg) |
| Sidekiq Web UI | https://crm.warmlyyours.com/sidekiq |
Rails admin login | Background-job queues, retries, scheduled set |
| Rails Event Store browser | https://crm.warmlyyours.com/res |
Rails admin login | The RES domain-event stream |
| HAProxy stats | http://100.123.47.52:8404/ |
tailnet only | DB write-VIP backend health (which node is "up") |
| SFTPGo admin | http://100.123.47.52:8080 |
tailnet only | SFTP users/sessions (call-records + PBX backups → R2) |
| Databasus (PITR controller) | http://100.68.157.49:4005 |
tailnet only | Configure/monitor backup sources; trigger PITR restores. Login op://IT/Databasus (Postgres Backup) |
| Mailpit — staging | http://100.123.47.52:8025 |
tailnet only | Captured staging outbound mail |
| Mailpit — dev | http://localhost:8025 |
local docker-compose | Captured dev outbound mail |
| AppSignal | https://appsignal.com/warmlyyours |
SaaS login | APM, exceptions, traces (apps Heatwave/production + Heatwave/staging) |
| HyperDX | — | planned, not deployed | Future app traces/logs/errors (ClickHouse; will replace AppSignal) |
Service topology (what runs where) — production
All ports below are bound to the tailnet IP unless noted. App→service wiring
goes over the internal kamal Docker network by DNS name, not these host ports
(host ports are for operators). Image tags here are indicative —
config/deploy.yml is authoritative for exact tags/digests (e.g. the haproxy
@sha256 pin).
Dallas — dal-latitude-heatwave-01 (100.123.47.52)
| Service (container) | Port (host→ctr) | Image | Purpose |
|---|---|---|---|
kamal-proxy |
(CF tunnel → :80) |
kamal-proxy | Rolling-deploy reverse proxy; health /up; routes crm/www/api/scan/mcp.warmlyyours.com |
| web / sidekiq | none published | ghcr.io/warmlyyours/heatwave |
Rails (Puma via Thruster :80→:3000); one consolidated Sidekiq |
heatwave-postgres (primary) |
5432 |
ghcr.io/warmlyyours/heatwave-postgres:18-noble |
PG18.4 RW primary; data on ZFS tank/prod-replica |
heatwave-pgbouncer (RW) |
6432 |
…/heatwave-pgbouncer:1.25.2 |
Session-mode pooler in front of the primary |
heatwave-haproxy (write VIP) |
6433 (+ 8404 stats) |
haproxy:3.0-alpine |
TCP failover router — the app's DATABASE_HOST; httpchk on each node's pg-health keeps only the live primary "up" |
heatwave-pg-health |
8008 |
…/heatwave-pg-health:v3 |
HTTP leader probe (200 primary / 503 standby) for HAProxy |
heatwave-valkey-cache |
none (kamal-net) | valkey/valkey:9.1 |
Logical DBs 1,2,4,5 · allkeys-lru · no persistence |
heatwave-valkey-sessions |
none | valkey/valkey:9.1 |
Logical DB 0 · noeviction · no persistence |
heatwave-valkey-queue |
none | valkey/valkey:9.1 |
Logical DB 3 (Sidekiq) · noeviction + AOF/RDB (durable) |
heatwave-sftp (SFTPGo) |
2222→2022 (public, PBX-only); 8080 UI (tailnet) |
drakkan/sftpgo:v2.6.6 |
Switchvox call-records + PBX-backup drop → Cloudflare R2 |
heatwave-playwright |
none (:3000) |
mcr.microsoft.com/playwright:v1.60.0-noble |
Headless browser for server-side PDF/email/upload flows |
heatwave-netdata |
19999 |
netdata/netdata:v2.10.3 |
Per-second observability (host + all of the above) |
Chicago — chi-latitude-heatwave-02 (100.68.157.49)
| Service (container) | Port | Image | Purpose |
|---|---|---|---|
heatwave-postgres-replica (standby) |
5432 |
…/heatwave-postgres:18-noble |
PG18 streaming standby (slot chicago_standby); read-offload target; promoted at flip |
heatwave-pgbouncer-replica (RO) |
6432 |
…/heatwave-pgbouncer:1.25.2 |
RO pooler behind the heatwave-db-ro tailnet VIP |
heatwave-pg-health-replica |
8008 |
…/heatwave-pg-health:v3 |
Standby recovery-state probe for HAProxy |
heatwave-netdata-replica |
19999 |
netdata/netdata:v2.10.3 |
Standby host + replica lag |
| Databasus controller | 4005 (UI) |
databasus/databasus:v3.46.0 (official) |
Agentless PITR backups (controller-driven pg_basebackup + pg_receivewal) → R2 |
Container images come from GHCR (ghcr.io/warmlyyours/…); registry auth is
per-developer via the gh CLI (no shared PAT). Secrets resolve through Kamal's
1Password adapter (vault IT) — only names are referenced in config, never
values.
Port-exposure map (the crux)
| Port | Bind | Service | Host | Notes |
|---|---|---|---|---|
| 80/443 | CF tunnel only | web (kamal-proxy) | Dallas | No inbound web port open; DOCKER-USER drops public 80/443 |
| 2222 | public → PBX IP only | SFTPGo SSH | Dallas | Firewalled to 144.202.57.170 (Switchvox) via edge fw + DOCKER-USER --ctorigdstport |
| 22 | tailnet | SSH | both | Latitude edge fw + UFW allow only 100.64.0.0/10 |
| 5432 | tailnet | Postgres | both | primary (Dallas) / standby (Chicago) |
| 6432 | tailnet | pgbouncer | both | RW (Dallas) / RO (Chicago) |
| 6433 | tailnet | HAProxy write-VIP | Dallas | app DB target |
| 8008 | tailnet | pg-health | both | HAProxy httpchk |
| 8404 | tailnet | HAProxy stats / /metrics |
Dallas | stats UI + netdata scrape |
| 8080 | tailnet | SFTPGo admin UI | Dallas | |
| 19999 | tailnet | Netdata | both | |
| 4005 | tailnet | Databasus UI | Chicago | |
| 8025 | tailnet | Mailpit UI (staging) | Dallas | SMTP 1025 stays kamal-net-internal |
| 6379 / 3000 | kamal-net only | Valkey ×3 / Playwright | Dallas | not host-published |
Data tier (DB read/write path)
- Write path: app
DATABASE_HOST=heatwave-haproxy,DATABASE_PORT=6433→
HAProxy (TCP passthrough) → the live primary's pgbouncer:6432
(session-mode — the app uses session advisory locks) → Postgres:5432.
Bothheatwaveandheatwave_versionsshare one FDW-linked cluster. - Routing decision: HAProxy
httpchks each node's pg-health:8008; only
the node answering200 "primary"is "up", so apg_promotere-routes the
write path with no app redeploy. - Read offload: the
heatwave-db-roTailscale VIP follows the standby's
pgbouncer for read-only consumers. - Cache/queue:
RedisConfig(config/initializers/100_redis_config.rb)
routes by logical DB toheatwave-valkey-{cache,sessions,queue}:6379— see
Valkey 3-Flavor Split and
DB Tier HA Architecture.
Failover & maintenance tooling
bin/recovery <env> {flip-db,rebuild-standby,topology}— promote the
standby + reroute the write VIP (cross-DC), rebuild a wiped node from a fresh
basebackup, or print the current topology. Snapshots the demoted dataset
(zfs snapshot) before wiping, as a rollback net.bin/maintenance {up,down} <env>— full maintenance window (proxy 503 →
stop web+sidekiq → on prod also stop the Chicagodatabasuscontroller container),
and the reverse. See the PG18 Failover Runbook and
HAProxy Routing Layer.
Backups & disaster recovery
- Databasus PITR (backup-of-record) — the agentless
databasuscontroller
container on Chicago streams an encrypted physical base + WAL to Cloudflare R2
bucketheatwave-postgres-backups-production(ENAM region, off-Latitude for DR).
Restore to any chosen second via the controller UI (:4005) orbin/restore's
physical/PITR option (wrapsconfig/databasus/databasus-recovery.sh).
AES-256-GCM key at/data/databasus-data/secret.key. - Databasus is itself disaster-recoverable (config-as-code) — the controller
keeps its entire config (admin / workspace / R2 storage / sources / schedules /
restore user) only in an embedded Postgres metadata DB on the host, so a host wipe
loses it (it did, 2026-06-15) and onlysecret.keyis in 1Password.script/setup_databasus.sh
rebuilds the whole thing idempotently in one command; a nightly encrypted metadata
snapshot (databasus-metadata-backup.timer→ R2databasus-metadata/) is the
turnkey alternative. Seeconfig/databasus/README.md. heatwave_versionsarchive —bin/versions-partitionsships completed
annual partitions (>5 yr) to R2heatwave-versions-archive-production(cold).bin/restore— pulls a recent logicalpg_dumpfrom Databasus→R2 to
seed a dev database (this is the dev path, not DR). See the
DR restore runbook and
Databasus PITR.
Cloudflare R2 buckets (all off-Latitude; bucket-scoped S3 tokens in 1Password vault IT)
| Bucket | Region | Purpose | Producer |
|---|---|---|---|
heatwave-postgres-backups-production |
ENAM | Databasus PITR (physical base + WAL). databasus-metadata/ prefix = the encrypted controller-config snapshots |
Databasus agent + backup-metadata.sh |
heatwave-versions-archive-production |
ENAM | Cold annual heatwave_versions partitions (>5 yr) |
bin/versions-partitions |
heatwave-frontend-assets-production |
ENAM | Content-hashed webpack assets, served same-origin by the www-edge Worker (survives Kamal deploys) | webpack build / deploy |
heatwave-call-recordings-production |
— | Switchvox call recordings (WarmlyYours/ prefix); Sidekiq imports from here |
SFTPGo pbx user |
heatwave-pbx-backups-production |
ENAM | Switchvox PBX system backups (bucket root) | SFTPGo pbx-backup user |
R2 bucket location is pinned on first creation of a name — delete+recreate reuses the
original location (--location enamonly applies to a fresh name). Tokens are minted via
script/setup_r2_*and stored in 1Password.
Edge & network protection (what protects it)
- Ingress: Cloudflare → outbound-only
cloudflaredQUIC tunnel (host
systemd) →kamal-proxy :80. No inbound web port is open on any host. TLS
terminates at Cloudflare (proxy.ssl: false); thewww-edgeWorker fronts
www/apex for locale redirects, R2 webpack-asset serving, and cache rules. - Firewall — three layers:
- Latitude edge firewall — allow SSH 22 from
100.64.0.0/10(Tailscale)
and SFTP 2222 from144.202.57.170(PBX); deny the rest. - Host UFW —
default deny incoming; allowlo,tailscale0,22/tcp,
and2222from the PBX IP. - DOCKER-USER iptables (
/usr/local/sbin/docker-user-fw.sh) — drops
public80/443, allows the tailnet, and handles the SFTP DNAT gotcha
(matchconntrack --ctorigdstport 2222, not--dport, because Docker
DNATs2222→2022before the rule sees it).
- Latitude edge firewall — allow SSH 22 from
- Cloudflare Access: staging only —
crm/www/api/mcp.warmlyyours.wsare
gated to thewy-employeesgroup (24 h session). Production has no CF
Access by design (public site; Rails handles its own auth). The planned
docs.warmlyyours.devportal will be gated to@warmlyyours.com. - Tailnet: all operator surfaces (SSH, Netdata, HAProxy stats, SFTPGo UI,
Databasus, Mailpit,psql) are Tailscale-only. Stable VIPsheatwave-db
(→primary) andheatwave-db-ro(→standby) survive CHI↔DAL flips. The
Tailscale ACL is Terraform-managed (infra/terraform/tailscale/).
Infrastructure as code (Terraform Cloud)
All infra is OpenTofu/Terraform under infra/terraform/, applied by HCP Terraform
(org warmlyyours), VCS-driven from this repo (auto-apply off — plans are reviewed,
applies are manual).
| Workspace | Module | Manages |
|---|---|---|
heatwave-latitude-production |
latitude/ |
The Latitude bare-metal server + cloud-init + per-host edge firewall |
heatwave-host-config |
host-config/ |
Re-runs the idempotent provision-host.sh over SSH (postfix relay, pg-maintenance/logwatch timers, ZED) — no reinstall |
heatwave-tailscale |
tailscale/ |
Tailnet ACL + the heatwave-db / heatwave-db-ro VIP services |
heatwave-cloudflare-zone-{production,staging} |
cloudflare-zone-*/ |
CF zone rulesets (WAF, cache, transforms) — the IaC source of truth; the dashboard is read-only |
- TFC agent: a
hashicorp/tfc-agentcontainer on Chicago (--network host, pool
heatwave-tailnet) runs the agent-mode workspaces' SSH provisioners — the one thing a
hosted runner can't do to a tailnet-only host. Dials out toapp.terraform.ioonly. It
is re-established by re-running the container (tokenop://IT/TFC-agent-token (heatwave-tailnet)). - ⚠️ Latitude reinstall landmine (now guarded): a
user_datachange on
latitudesh_server.hosttriggers a full server reinstall (all host data lost). On
2026-06-15 an approved plan whose only effective diff was aprovision-host.shedit
reinstalled the live Chicago standby — wiping the PG standby + the Databasus agent and
killing the in-flight TFC agent (it runs on Chicago).latitude/main.tfnow carries
lifecycle { ignore_changes = [user_data, billing] }, so editing cloud-init /
provision-host.sh is inert w.r.t. a running box (day-2 host config flows through
host-configover SSH). A genuine rebuild is now explicit:
tofu apply -replace=latitudesh_server.host. Full record:
doc/tasks/202606151240_CHICAGO_REINSTALL_DR_RECOVERY.md.
Application observability
| Tool | Status | Notes |
|---|---|---|
| AppSignal | Live | Current APM / exception / host-metric sink (Heatwave/production + /staging) |
| Netdata | Live | Per-second infra metrics; per-host agents (not a parent/child stream) |
| PgHero | Live | Rails-mounted PG performance dashboard |
| HyperDX | Planned | ClickHouse-backed traces/logs; intended to replace AppSignal; not in deploy.yml yet |
ZFS pool DEGRADED/FAULTED alerting is handled by ZED (ZFS Event Daemon)
over the postfix→SendGrid relay, not Netdata (the container→host firewall
blocks zpool collection).
Staging (summary)
Staging co-locates on the Dallas box under the heatwave-staging- service
prefix. It mirrors the full prod topology (primary + same-host standby for
HAProxy-failover rehearsal, Valkey ×3, pgbouncer, HAProxy), but every port binds
127.0.0.1 (except Mailpit/Netdata on the tailnet, since prod owns the
tailnet :19999). Staging hostnames crm/www/api/mcp.warmlyyours.ws are behind
Cloudflare Access (wy-employees). Staging sends no real mail — it's
captured by Mailpit (see the admin-services table above).
Keeping this current
When config/deploy.yml / deploy.staging.yml, config/netdata/,
config/haproxy/, or infra/terraform/ change, update this page. The facts
here were extracted from those files on the date above; they are the source of
truth, this is the index.