Skip to content

Infrastructure & Monitoring Inventory

The single-page answer to “what runs where, on which IP and port, what protects it, and what URL do I use to reach it.” This is hand-maintained reference — the source of truth is the config it summarizes (config/deploy.yml, config/deploy.staging.yml, config/netdata/, config/haproxy/, infra/terraform/). Keep it in sync when those change.

Last verified: 2026-06-15 (PG18, prod-on-Kamal, Dallas-primary / Chicago-standby).

Two Latitude.sh bare-metal boxes (f4.metal.medium: AMD EPYC 4564P 16c/32t, 192 GB RAM, 2×480 GB NVMe RAID1 + 2×2 TB NVMe, 2×10 GbE). The Kamal config addresses them exclusively by tailnet IP; the public IPs are physical-host facts (used for SSH-over-tailnet only — there is no public SSH).

HostDCPublic IPTailnet IPRole today
dal-latitude-heatwave-01Dallas67.213.118.15100.123.47.52Prod app + DB primary, and the entire staging stack (co-located)
chi-latitude-heatwave-02Chicago186.233.186.45100.68.157.49Prod DB standby + Databasus PITR backups (no app today)

At a planned Chicago cutover (“W3”) the app role flips to chi-02 via pg_promote + HAProxy reroute — see Failover & maintenance tooling below. Today, all production web/Sidekiq traffic is served from Dallas.

Internet
│ (HTTPS)
┌───────▼────────┐
│ Cloudflare │ edge TLS, www-edge Worker, WAF
└───────┬────────┘
│ outbound-only QUIC tunnel (cloudflared, host systemd)
┌───────────▼───────────── dal-01 (Dallas, 100.123.47.52) ──────────────┐
│ kamal-proxy :80 ─▶ web / sidekiq (Puma, Thruster) │
│ │ │
│ ▼ DATABASE_HOST=heatwave-haproxy:6433 │
│ HAProxy :6433 (TCP write-VIP) ─▶ pgbouncer :6432 ─▶ Postgres :5432 │
│ ▲ httpchk pg-health :8008 (PG18 PRIMARY) │
│ Valkey ×3 (cache / sessions / queue) · Playwright · SFTPGo · Netdata │
└───────────────────────────────┬───────────────────────────────────────┘
│ streaming replication (slot chicago_standby)
┌────────────────────────────────▼──── chi-02 (Chicago, 100.68.157.49) ──┐
│ Postgres :5432 (PG18 STANDBY) ◀ pgbouncer :6432 (RO VIP heatwave-db-ro)│
│ pg-health :8008 · Netdata · Databasus agent (PITR → R2) + UI :4005 │
└─────────────────────────────────────────────────────────────────────────┘

Administrative services — URLs to use (how to access)

Section titled “Administrative services — URLs to use (how to access)”

Get on the tailnet first (tailscale up; the boxes only accept SSH and admin UIs from 100.64.0.0/10). Admin logins live in 1Password vault IT.

ServiceURLReach / gatingWhat it shows
Netdata — prod (Dallas)http://100.123.47.52:19999tailnet onlyPer-second host + every container + Postgres (primary) + Valkey ×3 + pgbouncer + HAProxy + systemd units
Netdata — standby (Chicago)http://100.68.157.49:19999tailnet onlyStandby host + Postgres replica (recovery state / replication lag) + RO pgbouncer
Netdata — stagingssh -L 19999:localhost:19999 deploy@100.123.47.52http://localhost:19999localhost (SSH-forward)Staging stack (prod owns the tailnet :19999 on the shared box)
PgHerohttps://crm.warmlyyours.com/pgheroRails admin login (crm.* host, is_admin?)Slow queries (pg_stat_statements), table/index bloat, live queries, vacuum, index suggestions (hypopg)
Sidekiq Web UIhttps://crm.warmlyyours.com/sidekiqRails admin loginBackground-job queues, retries, scheduled set
Rails Event Store browserhttps://crm.warmlyyours.com/resRails admin loginThe RES domain-event stream
HAProxy statshttp://100.123.47.52:8404/tailnet onlyDB write-VIP backend health (which node is “up”)
SFTPGo adminhttp://100.123.47.52:8080tailnet onlySFTP users/sessions (call-records + PBX backups → R2)
Databasus (PITR controller)http://100.68.157.49:4005tailnet onlyConfigure/monitor backup sources; trigger PITR restores. Login op://IT/Databasus (Postgres Backup)
Mailpit — staginghttp://100.123.47.52:8025tailnet onlyCaptured staging outbound mail
Mailpit — devhttp://localhost:8025local docker-composeCaptured dev outbound mail
AppSignalhttps://appsignal.com/warmlyyoursSaaS loginAPM, exceptions, traces (apps Heatwave/production + Heatwave/staging)
HyperDXplanned, not deployedFuture app traces/logs/errors (ClickHouse; will replace AppSignal)

Service topology (what runs where) — production

Section titled “Service topology (what runs where) — production”

All ports below are bound to the tailnet IP unless noted. App→service wiring goes over the internal kamal Docker network by DNS name, not these host ports (host ports are for operators). Image tags here are indicative — config/deploy.yml is authoritative for exact tags/digests (e.g. the haproxy @sha256 pin).

Dallas — dal-latitude-heatwave-01 (100.123.47.52)

Section titled “Dallas — dal-latitude-heatwave-01 (100.123.47.52)”
Service (container)Port (host→ctr)ImagePurpose
kamal-proxy(CF tunnel → :80)kamal-proxyRolling-deploy reverse proxy; health /up; routes crm/www/api/scan/mcp.warmlyyours.com
web / sidekiqnone publishedghcr.io/warmlyyours/heatwaveRails (Puma via Thruster :80→:3000); one consolidated Sidekiq
heatwave-postgres (primary)5432ghcr.io/warmlyyours/heatwave-postgres:18-noblePG18.4 RW primary; data on ZFS tank/prod-replica
heatwave-pgbouncer (RW)6432…/heatwave-pgbouncer:1.25.2Session-mode pooler in front of the primary
heatwave-haproxy (write VIP)6433 (+ 8404 stats)haproxy:3.0-alpineTCP failover router — the app’s DATABASE_HOST; httpchk on each node’s pg-health keeps only the live primary “up”
heatwave-pg-health8008…/heatwave-pg-health:v3HTTP leader probe (200 primary / 503 standby) for HAProxy
heatwave-valkey-cachenone (kamal-net)valkey/valkey:9.1Logical DBs 1,2,4,5 · allkeys-lru · no persistence
heatwave-valkey-sessionsnonevalkey/valkey:9.1Logical DB 0 · noeviction · no persistence
heatwave-valkey-queuenonevalkey/valkey:9.1Logical DB 3 (Sidekiq) · noeviction + AOF/RDB (durable)
heatwave-sftp (SFTPGo)2222→2022 (public, PBX-only); 8080 UI (tailnet)drakkan/sftpgo:v2.6.6Switchvox call-records + PBX-backup drop → Cloudflare R2
heatwave-playwrightnone (:3000)mcr.microsoft.com/playwright:v1.60.0-nobleHeadless browser for server-side PDF/email/upload flows
heatwave-netdata19999netdata/netdata:v2.10.3Per-second observability (host + all of the above)

Chicago — chi-latitude-heatwave-02 (100.68.157.49)

Section titled “Chicago — chi-latitude-heatwave-02 (100.68.157.49)”
Service (container)PortImagePurpose
heatwave-postgres-replica (standby)5432…/heatwave-postgres:18-noblePG18 streaming standby (slot chicago_standby); read-offload target; promoted at flip
heatwave-pgbouncer-replica (RO)6432…/heatwave-pgbouncer:1.25.2RO pooler behind the heatwave-db-ro tailnet VIP
heatwave-pg-health-replica8008…/heatwave-pg-health:v3Standby recovery-state probe for HAProxy
heatwave-netdata-replica19999netdata/netdata:v2.10.3Standby host + replica lag
Databasus controller4005 (UI)databasus/databasus:v3.46.0 (official)Agentless PITR backups (controller-driven pg_basebackup + pg_receivewal) → R2

Container images come from GHCR (ghcr.io/warmlyyours/…); registry auth is per-developer via the gh CLI (no shared PAT). Secrets resolve through Kamal’s 1Password adapter (vault IT) — only names are referenced in config, never values.

PortBindServiceHostNotes
80/443CF tunnel onlyweb (kamal-proxy)DallasNo inbound web port open; DOCKER-USER drops public 80/443
2222public → PBX IP onlySFTPGo SSHDallasFirewalled to 144.202.57.170 (Switchvox) via edge fw + DOCKER-USER --ctorigdstport
22tailnetSSHbothLatitude edge fw + UFW allow only 100.64.0.0/10
5432tailnetPostgresbothprimary (Dallas) / standby (Chicago)
6432tailnetpgbouncerbothRW (Dallas) / RO (Chicago)
6433tailnetHAProxy write-VIPDallasapp DB target
8008tailnetpg-healthbothHAProxy httpchk
8404tailnetHAProxy stats / /metricsDallasstats UI + netdata scrape
8080tailnetSFTPGo admin UIDallas
19999tailnetNetdataboth
4005tailnetDatabasus UIChicago
8025tailnetMailpit UI (staging)DallasSMTP 1025 stays kamal-net-internal
6379 / 3000kamal-net onlyValkey ×3 / PlaywrightDallasnot host-published
  • Write path: app DATABASE_HOST=heatwave-haproxy, DATABASE_PORT=6433 → HAProxy (TCP passthrough) → the live primary’s pgbouncer :6432 (session-mode — the app uses session advisory locks) → Postgres :5432. Both heatwave and heatwave_versions share one FDW-linked cluster.
  • Routing decision: HAProxy httpchks each node’s pg-health :8008; only the node answering 200 "primary" is “up”, so a pg_promote re-routes the write path with no app redeploy.
  • Read offload: the heatwave-db-ro Tailscale VIP follows the standby’s pgbouncer for read-only consumers.
  • Cache/queue: RedisConfig (config/initializers/100_redis_config.rb) routes by logical DB to heatwave-valkey-{cache,sessions,queue}:6379 — see Valkey 3-Flavor Split and DB Tier HA Architecture.
  • bin/recovery <env> {flip-db,rebuild-standby,topology} — promote the standby + reroute the write VIP (cross-DC), rebuild a wiped node from a fresh basebackup, or print the current topology. Snapshots the demoted dataset (zfs snapshot) before wiping, as a rollback net.
  • bin/maintenance {up,down} <env> — full maintenance window (proxy 503 → stop web+sidekiq → on prod also stop the Chicago databasus controller container), and the reverse. See the PG18 Failover Runbook and HAProxy Routing Layer.
  • Databasus PITR (backup-of-record) — the agentless databasus controller container on Chicago streams an encrypted physical base + WAL to Cloudflare R2 bucket heatwave-postgres-backups-production (ENAM region, off-Latitude for DR). Restore to any chosen second via the controller UI (:4005) or bin/restore’s physical/PITR option (wraps config/databasus/databasus-recovery.sh). AES-256-GCM key at /data/databasus-data/secret.key.
  • Databasus is itself disaster-recoverable (config-as-code) — the controller keeps its entire config (admin / workspace / R2 storage / sources / schedules / restore user) only in an embedded Postgres metadata DB on the host, so a host wipe loses it (it did, 2026-06-15) and only secret.key is in 1Password. script/setup_databasus.sh rebuilds the whole thing idempotently in one command; a nightly encrypted metadata snapshot (databasus-metadata-backup.timer → R2 databasus-metadata/) is the turnkey alternative. See config/databasus/README.md.
  • heatwave_versions archivebin/versions-partitions ships completed annual partitions (>5 yr) to R2 heatwave-versions-archive-production (cold).
  • bin/restore — pulls a recent logical pg_dump from Databasus→R2 to seed a dev database (this is the dev path, not DR). See the DR restore runbook and Databasus PITR.

Cloudflare R2 buckets (all off-Latitude; bucket-scoped S3 tokens in 1Password vault IT)

Section titled “Cloudflare R2 buckets (all off-Latitude; bucket-scoped S3 tokens in 1Password vault IT)”
BucketRegionPurposeProducer
heatwave-postgres-backups-productionENAMDatabasus PITR (physical base + WAL). databasus-metadata/ prefix = the encrypted controller-config snapshotsDatabasus agent + backup-metadata.sh
heatwave-versions-archive-productionENAMCold annual heatwave_versions partitions (>5 yr)bin/versions-partitions
heatwave-frontend-assets-productionENAMContent-hashed webpack assets, served same-origin by the www-edge Worker (survives Kamal deploys)webpack build / deploy
heatwave-call-recordings-productionSwitchvox call recordings (WarmlyYours/ prefix); Sidekiq imports from hereSFTPGo pbx user
heatwave-pbx-backups-productionENAMSwitchvox PBX system backups (bucket root)SFTPGo pbx-backup user

R2 bucket location is pinned on first creation of a name — delete+recreate reuses the original location (--location enam only applies to a fresh name). Tokens are minted via script/setup_r2_* and stored in 1Password.

Edge & network protection (what protects it)

Section titled “Edge & network protection (what protects it)”
  • Ingress: Cloudflare → outbound-only cloudflared QUIC tunnel (host systemd) → kamal-proxy :80. No inbound web port is open on any host. TLS terminates at Cloudflare (proxy.ssl: false); the www-edge Worker fronts www/apex for locale redirects, R2 webpack-asset serving, and cache rules.
  • Firewall — three layers:
    1. Latitude edge firewall — allow SSH 22 from 100.64.0.0/10 (Tailscale) and SFTP 2222 from 144.202.57.170 (PBX); deny the rest.
    2. Host UFWdefault deny incoming; allow lo, tailscale0, 22/tcp, and 2222 from the PBX IP.
    3. DOCKER-USER iptables (/usr/local/sbin/docker-user-fw.sh) — drops public 80/443, allows the tailnet, and handles the SFTP DNAT gotcha (match conntrack --ctorigdstport 2222, not --dport, because Docker DNATs 2222→2022 before the rule sees it).
  • Cloudflare Access: staging onlycrm/www/api/mcp.warmlyyours.ws are gated to the wy-employees group (24 h session). Production has no CF Access by design (public site; Rails handles its own auth). The planned docs.warmlyyours.dev portal will be gated to @warmlyyours.com.
  • Tailnet: all operator surfaces (SSH, Netdata, HAProxy stats, SFTPGo UI, Databasus, Mailpit, psql) are Tailscale-only. Stable VIPs heatwave-db (→primary) and heatwave-db-ro (→standby) survive CHI↔DAL flips. The Tailscale ACL is Terraform-managed (infra/terraform/tailscale/).

All infra is OpenTofu/Terraform under infra/terraform/, applied by HCP Terraform (org warmlyyours), VCS-driven from this repo (auto-apply off — plans are reviewed, applies are manual).

WorkspaceModuleManages
heatwave-latitude-productionlatitude/The Latitude bare-metal server + cloud-init + per-host edge firewall
heatwave-host-confighost-config/Re-runs the idempotent provision-host.sh over SSH (postfix relay, pg-maintenance/logwatch timers, ZED) — no reinstall
heatwave-tailscaletailscale/Tailnet ACL + the heatwave-db / heatwave-db-ro VIP services
heatwave-cloudflare-zone-{production,staging}cloudflare-zone-*/CF zone rulesets (WAF, cache, transforms) — the IaC source of truth; the dashboard is read-only
  • TFC agent: a hashicorp/tfc-agent container on Chicago (--network host, pool heatwave-tailnet) runs the agent-mode workspaces’ SSH provisioners — the one thing a hosted runner can’t do to a tailnet-only host. Dials out to app.terraform.io only. It is re-established by re-running the container (token op://IT/TFC-agent-token (heatwave-tailnet)).
  • ⚠️ Latitude reinstall landmine (now guarded): a user_data change on latitudesh_server.host triggers a full server reinstall (all host data lost). On 2026-06-15 an approved plan whose only effective diff was a provision-host.sh edit reinstalled the live Chicago standby — wiping the PG standby + the Databasus agent and killing the in-flight TFC agent (it runs on Chicago). latitude/main.tf now carries lifecycle { ignore_changes = [user_data, billing] }, so editing cloud-init / provision-host.sh is inert w.r.t. a running box (day-2 host config flows through host-config over SSH). A genuine rebuild is now explicit: tofu apply -replace=latitudesh_server.host. Full record: doc/tasks/202606151240_CHICAGO_REINSTALL_DR_RECOVERY.md.
ToolStatusNotes
AppSignalLiveCurrent APM / exception / host-metric sink (Heatwave/production + /staging)
NetdataLivePer-second infra metrics; per-host agents (not a parent/child stream)
PgHeroLiveRails-mounted PG performance dashboard
HyperDXPlannedClickHouse-backed traces/logs; intended to replace AppSignal; not in deploy.yml yet

ZFS pool DEGRADED/FAULTED alerting is handled by ZED (ZFS Event Daemon) over the postfix→SendGrid relay, not Netdata (the container→host firewall blocks zpool collection).

Staging co-locates on the Dallas box under the heatwave-staging- service prefix. It mirrors the full prod topology (primary + same-host standby for HAProxy-failover rehearsal, Valkey ×3, pgbouncer, HAProxy), but every port binds 127.0.0.1 (except Mailpit/Netdata on the tailnet, since prod owns the tailnet :19999). Staging hostnames crm/www/api/mcp.warmlyyours.ws are behind Cloudflare Access (wy-employees). Staging sends no real mail — it’s captured by Mailpit (see the admin-services table above).

When config/deploy.yml / deploy.staging.yml, config/netdata/, config/haproxy/, or infra/terraform/ change, update this page. The facts here were extracted from those files on the date above; they are the source of truth, this is the index.