Infrastructure & Monitoring Inventory

The single-page answer to "what runs where, on which IP and port, what protects
it, and what URL do I use to reach it." This is hand-maintained reference —
the source of truth is the config it summarizes (config/deploy.yml,
config/deploy.staging.yml, config/netdata/, config/haproxy/,
infra/terraform/). Keep it in sync when those change.

Last verified: 2026-06-15 (PG18, prod-on-Kamal, Dallas-primary / Chicago-standby).

:::note[How to read the "reach" column]

tailnet = reachable only over Tailscale (the box's 100.x IP); not public.
localhost = bound to 127.0.0.1 on the box; reach via SSH port-forward.
public = bound to 0.0.0.0; the only such port in the fleet is SFTP 2222, and it's firewalled to the PBX source IP.
CF tunnel = arrives via an outbound-only Cloudflare Tunnel → kamal-proxy; no inbound web port is open on any host.
:::

Fleet — physical hosts (where)

Two Latitude.sh bare-metal boxes (f4.metal.medium: AMD EPYC 4564P 16c/32t,
192 GB RAM, 2×480 GB NVMe RAID1 + 2×2 TB NVMe, 2×10 GbE). The Kamal config
addresses them exclusively by tailnet IP; the public IPs are physical-host
facts (used for SSH-over-tailnet only — there is no public SSH).

Host	DC	Public IP	Tailnet IP	Role today
`dal-latitude-heatwave-01`	Dallas	`67.213.118.15`	`100.123.47.52`	Prod app + DB primary, and the entire staging stack (co-located)
`chi-latitude-heatwave-02`	Chicago	`186.233.186.45`	`100.68.157.49`	Prod DB standby + Databasus PITR backups (no app today)

At a planned Chicago cutover ("W3") the app role flips to chi-02 via
pg_promote + HAProxy reroute — see Failover & maintenance tooling below.
Today, all production web/Sidekiq traffic is served from Dallas.

                 Internet
                    │  (HTTPS)
            ┌───────▼────────┐
            │   Cloudflare    │  edge TLS, www-edge Worker, WAF
            └───────┬────────┘
                    │  outbound-only QUIC tunnel (cloudflared, host systemd)
        ┌───────────▼───────────── dal-01 (Dallas, 100.123.47.52) ──────────────┐
        │  kamal-proxy :80 ─▶ web / sidekiq (Puma, Thruster)                     │
        │        │                                                              │
        │        ▼  DATABASE_HOST=heatwave-haproxy:6433                          │
        │  HAProxy :6433 (TCP write-VIP) ─▶ pgbouncer :6432 ─▶ Postgres :5432    │
        │        ▲ httpchk pg-health :8008                    (PG18 PRIMARY)     │
        │  Valkey ×3 (cache / sessions / queue) · Playwright · SFTPGo · Netdata  │
        └───────────────────────────────┬───────────────────────────────────────┘
                                         │ streaming replication (slot chicago_standby)
        ┌────────────────────────────────▼──── chi-02 (Chicago, 100.68.157.49) ──┐
        │  Postgres :5432 (PG18 STANDBY) ◀ pgbouncer :6432 (RO VIP heatwave-db-ro)│
        │  pg-health :8008 · Netdata · Databasus agent (PITR → R2) + UI :4005     │
        └─────────────────────────────────────────────────────────────────────────┘

Administrative services — URLs to use (how to access)

Get on the tailnet first (tailscale up; the boxes only accept SSH and
admin UIs from 100.64.0.0/10). Admin logins live in 1Password vault IT.

Service	URL	Reach / gating	What it shows
Netdata — prod (Dallas)	`http://100.123.47.52:19999`	tailnet only	Per-second host + every container + Postgres (primary) + Valkey ×3 + pgbouncer + HAProxy + systemd units
Netdata — standby (Chicago)	`http://100.68.157.49:19999`	tailnet only	Standby host + Postgres replica (recovery state / replication lag) + RO pgbouncer
Netdata — staging	`ssh -L 19999:localhost:19999 deploy@100.123.47.52` → `http://localhost:19999`	localhost (SSH-forward)	Staging stack (prod owns the tailnet `:19999` on the shared box)
PgHero	`https://crm.warmlyyours.com/pghero`	Rails admin login (`crm.*` host, `is_admin?`)	Slow queries (pg_stat_statements), table/index bloat, live queries, vacuum, index suggestions (hypopg)
Sidekiq Web UI	`https://crm.warmlyyours.com/sidekiq`	Rails admin login	Background-job queues, retries, scheduled set
Rails Event Store browser	`https://crm.warmlyyours.com/res`	Rails admin login	The RES domain-event stream
HAProxy stats	`http://100.123.47.52:8404/`	tailnet only	DB write-VIP backend health (which node is "up")
SFTPGo admin	`http://100.123.47.52:8080`	tailnet only	SFTP users/sessions (call-records + PBX backups → R2)
Databasus (PITR controller)	`http://100.68.157.49:4005`	tailnet only	Configure/monitor backup sources; trigger PITR restores. Login `op://IT/Databasus (Postgres Backup)`
Mailpit — staging	`http://100.123.47.52:8025`	tailnet only	Captured staging outbound mail
Mailpit — dev	`http://localhost:8025`	local docker-compose	Captured dev outbound mail
AppSignal	`https://appsignal.com/warmlyyours`	SaaS login	APM, exceptions, traces (apps `Heatwave/production` + `Heatwave/staging`)
HyperDX	—	planned, not deployed	Future app traces/logs/errors (ClickHouse; will replace AppSignal)

Service topology (what runs where) — production

All ports below are bound to the tailnet IP unless noted. App→service wiring
goes over the internal kamal Docker network by DNS name, not these host ports
(host ports are for operators). Image tags here are indicative —
config/deploy.yml is authoritative for exact tags/digests (e.g. the haproxy
@sha256 pin).

Dallas — `dal-latitude-heatwave-01` (`100.123.47.52`)

Service (container)	Port (host→ctr)	Image	Purpose
`kamal-proxy`	(CF tunnel → `:80`)	kamal-proxy	Rolling-deploy reverse proxy; health `/up`; routes `crm/www/api/scan/mcp.warmlyyours.com`
web / sidekiq	none published	`ghcr.io/warmlyyours/heatwave`	Rails (Puma via Thruster :80→:3000); one consolidated Sidekiq
`heatwave-postgres` (primary)	`5432`	`ghcr.io/warmlyyours/heatwave-postgres:18-noble`	PG18.4 RW primary; data on ZFS `tank/prod-replica`
`heatwave-pgbouncer` (RW)	`6432`	`…/heatwave-pgbouncer:1.25.2`	Session-mode pooler in front of the primary
`heatwave-haproxy` (write VIP)	`6433` (+ `8404` stats)	`haproxy:3.0-alpine`	TCP failover router — the app's `DATABASE_HOST`; httpchk on each node's pg-health keeps only the live primary "up"
`heatwave-pg-health`	`8008`	`…/heatwave-pg-health:v3`	HTTP leader probe (`200 primary` / `503 standby`) for HAProxy
`heatwave-valkey-cache`	none (kamal-net)	`valkey/valkey:9.1`	Logical DBs 1,2,4,5 · `allkeys-lru` · no persistence
`heatwave-valkey-sessions`	none	`valkey/valkey:9.1`	Logical DB 0 · `noeviction` · no persistence
`heatwave-valkey-queue`	none	`valkey/valkey:9.1`	Logical DB 3 (Sidekiq) · `noeviction` + AOF/RDB (durable)
`heatwave-sftp` (SFTPGo)	`2222`→2022 (public, PBX-only); `8080` UI (tailnet)	`drakkan/sftpgo:v2.6.6`	Switchvox call-records + PBX-backup drop → Cloudflare R2
`heatwave-playwright`	none (`:3000`)	`mcr.microsoft.com/playwright:v1.60.0-noble`	Headless browser for server-side PDF/email/upload flows
`heatwave-netdata`	`19999`	`netdata/netdata:v2.10.3`	Per-second observability (host + all of the above)

Chicago — `chi-latitude-heatwave-02` (`100.68.157.49`)

Service (container)	Port	Image	Purpose
`heatwave-postgres-replica` (standby)	`5432`	`…/heatwave-postgres:18-noble`	PG18 streaming standby (slot `chicago_standby`); read-offload target; promoted at flip
`heatwave-pgbouncer-replica` (RO)	`6432`	`…/heatwave-pgbouncer:1.25.2`	RO pooler behind the `heatwave-db-ro` tailnet VIP
`heatwave-pg-health-replica`	`8008`	`…/heatwave-pg-health:v3`	Standby recovery-state probe for HAProxy
`heatwave-netdata-replica`	`19999`	`netdata/netdata:v2.10.3`	Standby host + replica lag
Databasus controller	`4005` (UI)	`databasus/databasus:v3.46.0` (official)	Agentless PITR backups (controller-driven `pg_basebackup` + `pg_receivewal`) → R2

Container images come from GHCR (ghcr.io/warmlyyours/…); registry auth is
per-developer via the gh CLI (no shared PAT). Secrets resolve through Kamal's
1Password adapter (vault IT) — only names are referenced in config, never
values.

Port-exposure map (the crux)

Port	Bind	Service	Host	Notes
80/443	CF tunnel only	web (kamal-proxy)	Dallas	No inbound web port open; DOCKER-USER drops public 80/443
2222	public → PBX IP only	SFTPGo SSH	Dallas	Firewalled to `144.202.57.170` (Switchvox) via edge fw + DOCKER-USER `--ctorigdstport`
22	tailnet	SSH	both	Latitude edge fw + UFW allow only `100.64.0.0/10`
5432	tailnet	Postgres	both	primary (Dallas) / standby (Chicago)
6432	tailnet	pgbouncer	both	RW (Dallas) / RO (Chicago)
6433	tailnet	HAProxy write-VIP	Dallas	app DB target
8008	tailnet	pg-health	both	HAProxy httpchk
8404	tailnet	HAProxy stats / `/metrics`	Dallas	stats UI + netdata scrape
8080	tailnet	SFTPGo admin UI	Dallas
19999	tailnet	Netdata	both
4005	tailnet	Databasus UI	Chicago
8025	tailnet	Mailpit UI (staging)	Dallas	SMTP 1025 stays kamal-net-internal
6379 / 3000	kamal-net only	Valkey ×3 / Playwright	Dallas	not host-published

Data tier (DB read/write path)

Write path: app DATABASE_HOST=heatwave-haproxy, DATABASE_PORT=6433 →
HAProxy (TCP passthrough) → the live primary's pgbouncer :6432
(session-mode — the app uses session advisory locks) → Postgres :5432.
Both heatwave and heatwave_versions share one FDW-linked cluster.
Routing decision: HAProxy httpchks each node's pg-health :8008; only
the node answering 200 "primary" is "up", so a pg_promote re-routes the
write path with no app redeploy.
Read offload: the heatwave-db-ro Tailscale VIP follows the standby's
pgbouncer for read-only consumers.
Cache/queue: RedisConfig (config/initializers/100_redis_config.rb)
routes by logical DB to heatwave-valkey-{cache,sessions,queue}:6379 — see
Valkey 3-Flavor Split and
DB Tier HA Architecture.

Failover & maintenance tooling

bin/recovery <env> {flip-db,rebuild-standby,topology} — promote the
standby + reroute the write VIP (cross-DC), rebuild a wiped node from a fresh
basebackup, or print the current topology. Snapshots the demoted dataset
(zfs snapshot) before wiping, as a rollback net.
bin/maintenance {up,down} <env> — full maintenance window (proxy 503 →
stop web+sidekiq → on prod also stop the Chicago databasus controller container),
and the reverse. See the PG18 Failover Runbook and
HAProxy Routing Layer.

Backups & disaster recovery

Databasus PITR (backup-of-record) — the agentless databasus controller
container on Chicago streams an encrypted physical base + WAL to Cloudflare R2
bucket heatwave-postgres-backups-production (ENAM region, off-Latitude for DR).
Restore to any chosen second via the controller UI (:4005) or bin/restore's
physical/PITR option (wraps config/databasus/databasus-recovery.sh).
AES-256-GCM key at /data/databasus-data/secret.key.
Databasus is itself disaster-recoverable (config-as-code) — the controller
keeps its entire config (admin / workspace / R2 storage / sources / schedules /
restore user) only in an embedded Postgres metadata DB on the host, so a host wipe
loses it (it did, 2026-06-15) and only secret.key is in 1Password. script/setup_databasus.sh
rebuilds the whole thing idempotently in one command; a nightly encrypted metadata
snapshot (databasus-metadata-backup.timer → R2 databasus-metadata/) is the
turnkey alternative. See config/databasus/README.md.
heatwave_versions archive — bin/versions-partitions ships completed
annual partitions (>5 yr) to R2 heatwave-versions-archive-production (cold).
bin/restore — pulls a recent logical pg_dump from Databasus→R2 to
seed a dev database (this is the dev path, not DR). See the
DR restore runbook and
Databasus PITR.

Cloudflare R2 buckets (all off-Latitude; bucket-scoped S3 tokens in 1Password vault IT)

Bucket	Region	Purpose	Producer
`heatwave-postgres-backups-production`	ENAM	Databasus PITR (physical base + WAL). `databasus-metadata/` prefix = the encrypted controller-config snapshots	Databasus agent + `backup-metadata.sh`
`heatwave-versions-archive-production`	ENAM	Cold annual `heatwave_versions` partitions (>5 yr)	`bin/versions-partitions`
`heatwave-frontend-assets-production`	ENAM	Content-hashed webpack assets, served same-origin by the www-edge Worker (survives Kamal deploys)	webpack build / deploy
`heatwave-call-recordings-production`	—	Switchvox call recordings (`WarmlyYours/` prefix); Sidekiq imports from here	SFTPGo `pbx` user
`heatwave-pbx-backups-production`	ENAM	Switchvox PBX system backups (bucket root)	SFTPGo `pbx-backup` user

R2 bucket location is pinned on first creation of a name — delete+recreate reuses the
original location (--location enam only applies to a fresh name). Tokens are minted via
script/setup_r2_* and stored in 1Password.

Edge & network protection (what protects it)

Ingress: Cloudflare → outbound-only cloudflared QUIC tunnel (host
systemd) → kamal-proxy :80. No inbound web port is open on any host. TLS
terminates at Cloudflare (proxy.ssl: false); the www-edge Worker fronts
www/apex for locale redirects, R2 webpack-asset serving, and cache rules.
Firewall — three layers:
1. Latitude edge firewall — allow SSH 22 from 100.64.0.0/10 (Tailscale)
  and SFTP 2222 from 144.202.57.170 (PBX); deny the rest.
2. Host UFW — default deny incoming; allow lo, tailscale0, 22/tcp,
  and 2222 from the PBX IP.
3. DOCKER-USER iptables (/usr/local/sbin/docker-user-fw.sh) — drops
  public 80/443, allows the tailnet, and handles the SFTP DNAT gotcha
  (match conntrack --ctorigdstport 2222, not --dport, because Docker
  DNATs 2222→2022 before the rule sees it).
Cloudflare Access: staging only — crm/www/api/mcp.warmlyyours.ws are
gated to the wy-employees group (24 h session). Production has no CF
Access by design (public site; Rails handles its own auth). The planned
docs.warmlyyours.dev portal will be gated to @warmlyyours.com.
Tailnet: all operator surfaces (SSH, Netdata, HAProxy stats, SFTPGo UI,
Databasus, Mailpit, psql) are Tailscale-only. Stable VIPs heatwave-db
(→primary) and heatwave-db-ro (→standby) survive CHI↔DAL flips. The
Tailscale ACL is Terraform-managed (infra/terraform/tailscale/).

Infrastructure as code (Terraform Cloud)

All infra is OpenTofu/Terraform under infra/terraform/, applied by HCP Terraform
(org warmlyyours), VCS-driven from this repo (auto-apply off — plans are reviewed,
applies are manual).

Workspace	Module	Manages
`heatwave-latitude-production`	`latitude/`	The Latitude bare-metal server + cloud-init + per-host edge firewall
`heatwave-host-config`	`host-config/`	Re-runs the idempotent `provision-host.sh` over SSH (postfix relay, pg-maintenance/logwatch timers, ZED) — no reinstall
`heatwave-tailscale`	`tailscale/`	Tailnet ACL + the `heatwave-db` / `heatwave-db-ro` VIP services
`heatwave-cloudflare-zone-{production,staging}`	`cloudflare-zone-*/`	CF zone rulesets (WAF, cache, transforms) — the IaC source of truth; the dashboard is read-only

TFC agent: a hashicorp/tfc-agent container on Chicago (--network host, pool
heatwave-tailnet) runs the agent-mode workspaces' SSH provisioners — the one thing a
hosted runner can't do to a tailnet-only host. Dials out to app.terraform.io only. It
is re-established by re-running the container (token op://IT/TFC-agent-token (heatwave-tailnet)).
⚠️ Latitude reinstall landmine (now guarded): a user_data change on
latitudesh_server.host triggers a full server reinstall (all host data lost). On
2026-06-15 an approved plan whose only effective diff was a provision-host.sh edit
reinstalled the live Chicago standby — wiping the PG standby + the Databasus agent and
killing the in-flight TFC agent (it runs on Chicago). latitude/main.tf now carries
lifecycle { ignore_changes = [user_data, billing] }, so editing cloud-init /
provision-host.sh is inert w.r.t. a running box (day-2 host config flows through
host-config over SSH). A genuine rebuild is now explicit:
tofu apply -replace=latitudesh_server.host. Full record:
doc/tasks/202606151240_CHICAGO_REINSTALL_DR_RECOVERY.md.

Application observability

Tool	Status	Notes
AppSignal	Live	Current APM / exception / host-metric sink (`Heatwave/production` + `/staging`)
Netdata	Live	Per-second infra metrics; per-host agents (not a parent/child stream)
PgHero	Live	Rails-mounted PG performance dashboard
HyperDX	Planned	ClickHouse-backed traces/logs; intended to replace AppSignal; not in `deploy.yml` yet

ZFS pool DEGRADED/FAULTED alerting is handled by ZED (ZFS Event Daemon)
over the postfix→SendGrid relay, not Netdata (the container→host firewall
blocks zpool collection).

Staging (summary)

Staging co-locates on the Dallas box under the heatwave-staging- service
prefix. It mirrors the full prod topology (primary + same-host standby for
HAProxy-failover rehearsal, Valkey ×3, pgbouncer, HAProxy), but every port binds
127.0.0.1 (except Mailpit/Netdata on the tailnet, since prod owns the
tailnet :19999). Staging hostnames crm/www/api/mcp.warmlyyours.ws are behind
Cloudflare Access (wy-employees). Staging sends no real mail — it's
captured by Mailpit (see the admin-services table above).

Keeping this current

When config/deploy.yml / deploy.staging.yml, config/netdata/,
config/haproxy/, or infra/terraform/ change, update this page. The facts
here were extracted from those files on the date above; they are the source of
truth, this is the index.