Infrastructure & Monitoring Inventory

The single-page answer to "what runs where, on which IP and port, what protects
it, and what URL do I use to reach it."
This is hand-maintained reference —
the source of truth is the config it summarizes (config/deploy.yml,
config/deploy.staging.yml, config/netdata/, config/haproxy/,
infra/terraform/). Keep it in sync when those change.

Last verified: 2026-06-15 (PG18, prod-on-Kamal, Dallas-primary / Chicago-standby).

:::note[How to read the "reach" column]

  • tailnet = reachable only over Tailscale (the box's 100.x IP); not public.
  • localhost = bound to 127.0.0.1 on the box; reach via SSH port-forward.
  • public = bound to 0.0.0.0; the only such port in the fleet is SFTP 2222, and it's firewalled to the PBX source IP.
  • CF tunnel = arrives via an outbound-only Cloudflare Tunnel → kamal-proxy; no inbound web port is open on any host.
    :::

Fleet — physical hosts (where)

Two Latitude.sh bare-metal boxes (f4.metal.medium: AMD EPYC 4564P 16c/32t,
192 GB RAM, 2×480 GB NVMe RAID1 + 2×2 TB NVMe, 2×10 GbE). The Kamal config
addresses them exclusively by tailnet IP; the public IPs are physical-host
facts (used for SSH-over-tailnet only — there is no public SSH).

Host DC Public IP Tailnet IP Role today
dal-latitude-heatwave-01 Dallas 67.213.118.15 100.123.47.52 Prod app + DB primary, and the entire staging stack (co-located)
chi-latitude-heatwave-02 Chicago 186.233.186.45 100.68.157.49 Prod DB standby + Databasus PITR backups (no app today)

At a planned Chicago cutover ("W3") the app role flips to chi-02 via
pg_promote + HAProxy reroute — see Failover & maintenance tooling below.
Today, all production web/Sidekiq traffic is served from Dallas.

                 Internet
                    │  (HTTPS)
            ┌───────▼────────┐
            │   Cloudflare    │  edge TLS, www-edge Worker, WAF
            └───────┬────────┘
                    │  outbound-only QUIC tunnel (cloudflared, host systemd)
        ┌───────────▼───────────── dal-01 (Dallas, 100.123.47.52) ──────────────┐
        │  kamal-proxy :80 ─▶ web / sidekiq (Puma, Thruster)                     │
        │        │                                                              │
        │        ▼  DATABASE_HOST=heatwave-haproxy:6433                          │
        │  HAProxy :6433 (TCP write-VIP) ─▶ pgbouncer :6432 ─▶ Postgres :5432    │
        │        ▲ httpchk pg-health :8008                    (PG18 PRIMARY)     │
        │  Valkey ×3 (cache / sessions / queue) · Playwright · SFTPGo · Netdata  │
        └───────────────────────────────┬───────────────────────────────────────┘
                                         │ streaming replication (slot chicago_standby)
        ┌────────────────────────────────▼──── chi-02 (Chicago, 100.68.157.49) ──┐
        │  Postgres :5432 (PG18 STANDBY) ◀ pgbouncer :6432 (RO VIP heatwave-db-ro)│
        │  pg-health :8008 · Netdata · Databasus agent (PITR → R2) + UI :4005     │
        └─────────────────────────────────────────────────────────────────────────┘

Administrative services — URLs to use (how to access)

Get on the tailnet first (tailscale up; the boxes only accept SSH and
admin UIs from 100.64.0.0/10). Admin logins live in 1Password vault IT.

Service URL Reach / gating What it shows
Netdata — prod (Dallas) http://100.123.47.52:19999 tailnet only Per-second host + every container + Postgres (primary) + Valkey ×3 + pgbouncer + HAProxy + systemd units
Netdata — standby (Chicago) http://100.68.157.49:19999 tailnet only Standby host + Postgres replica (recovery state / replication lag) + RO pgbouncer
Netdata — staging ssh -L 19999:localhost:19999 deploy@100.123.47.52http://localhost:19999 localhost (SSH-forward) Staging stack (prod owns the tailnet :19999 on the shared box)
PgHero https://crm.warmlyyours.com/pghero Rails admin login (crm.* host, is_admin?) Slow queries (pg_stat_statements), table/index bloat, live queries, vacuum, index suggestions (hypopg)
Sidekiq Web UI https://crm.warmlyyours.com/sidekiq Rails admin login Background-job queues, retries, scheduled set
Rails Event Store browser https://crm.warmlyyours.com/res Rails admin login The RES domain-event stream
HAProxy stats http://100.123.47.52:8404/ tailnet only DB write-VIP backend health (which node is "up")
SFTPGo admin http://100.123.47.52:8080 tailnet only SFTP users/sessions (call-records + PBX backups → R2)
Databasus (PITR controller) http://100.68.157.49:4005 tailnet only Configure/monitor backup sources; trigger PITR restores. Login op://IT/Databasus (Postgres Backup)
Mailpit — staging http://100.123.47.52:8025 tailnet only Captured staging outbound mail
Mailpit — dev http://localhost:8025 local docker-compose Captured dev outbound mail
AppSignal https://appsignal.com/warmlyyours SaaS login APM, exceptions, traces (apps Heatwave/production + Heatwave/staging)
HyperDX planned, not deployed Future app traces/logs/errors (ClickHouse; will replace AppSignal)

Service topology (what runs where) — production

All ports below are bound to the tailnet IP unless noted. App→service wiring
goes over the internal kamal Docker network by DNS name, not these host ports
(host ports are for operators). Image tags here are indicative —
config/deploy.yml is authoritative for exact tags/digests (e.g. the haproxy
@sha256 pin).

Dallas — dal-latitude-heatwave-01 (100.123.47.52)

Service (container) Port (host→ctr) Image Purpose
kamal-proxy (CF tunnel → :80) kamal-proxy Rolling-deploy reverse proxy; health /up; routes crm/www/api/scan/mcp.warmlyyours.com
web / sidekiq none published ghcr.io/warmlyyours/heatwave Rails (Puma via Thruster :80→:3000); one consolidated Sidekiq
heatwave-postgres (primary) 5432 ghcr.io/warmlyyours/heatwave-postgres:18-noble PG18.4 RW primary; data on ZFS tank/prod-replica
heatwave-pgbouncer (RW) 6432 …/heatwave-pgbouncer:1.25.2 Session-mode pooler in front of the primary
heatwave-haproxy (write VIP) 6433 (+ 8404 stats) haproxy:3.0-alpine TCP failover router — the app's DATABASE_HOST; httpchk on each node's pg-health keeps only the live primary "up"
heatwave-pg-health 8008 …/heatwave-pg-health:v3 HTTP leader probe (200 primary / 503 standby) for HAProxy
heatwave-valkey-cache none (kamal-net) valkey/valkey:9.1 Logical DBs 1,2,4,5 · allkeys-lru · no persistence
heatwave-valkey-sessions none valkey/valkey:9.1 Logical DB 0 · noeviction · no persistence
heatwave-valkey-queue none valkey/valkey:9.1 Logical DB 3 (Sidekiq) · noeviction + AOF/RDB (durable)
heatwave-sftp (SFTPGo) 2222→2022 (public, PBX-only); 8080 UI (tailnet) drakkan/sftpgo:v2.6.6 Switchvox call-records + PBX-backup drop → Cloudflare R2
heatwave-playwright none (:3000) mcr.microsoft.com/playwright:v1.60.0-noble Headless browser for server-side PDF/email/upload flows
heatwave-netdata 19999 netdata/netdata:v2.10.3 Per-second observability (host + all of the above)

Chicago — chi-latitude-heatwave-02 (100.68.157.49)

Service (container) Port Image Purpose
heatwave-postgres-replica (standby) 5432 …/heatwave-postgres:18-noble PG18 streaming standby (slot chicago_standby); read-offload target; promoted at flip
heatwave-pgbouncer-replica (RO) 6432 …/heatwave-pgbouncer:1.25.2 RO pooler behind the heatwave-db-ro tailnet VIP
heatwave-pg-health-replica 8008 …/heatwave-pg-health:v3 Standby recovery-state probe for HAProxy
heatwave-netdata-replica 19999 netdata/netdata:v2.10.3 Standby host + replica lag
Databasus controller 4005 (UI) databasus/databasus:v3.46.0 (official) Agentless PITR backups (controller-driven pg_basebackup + pg_receivewal) → R2

Container images come from GHCR (ghcr.io/warmlyyours/…); registry auth is
per-developer via the gh CLI (no shared PAT). Secrets resolve through Kamal's
1Password adapter (vault IT) — only names are referenced in config, never
values.

Port-exposure map (the crux)

Port Bind Service Host Notes
80/443 CF tunnel only web (kamal-proxy) Dallas No inbound web port open; DOCKER-USER drops public 80/443
2222 public → PBX IP only SFTPGo SSH Dallas Firewalled to 144.202.57.170 (Switchvox) via edge fw + DOCKER-USER --ctorigdstport
22 tailnet SSH both Latitude edge fw + UFW allow only 100.64.0.0/10
5432 tailnet Postgres both primary (Dallas) / standby (Chicago)
6432 tailnet pgbouncer both RW (Dallas) / RO (Chicago)
6433 tailnet HAProxy write-VIP Dallas app DB target
8008 tailnet pg-health both HAProxy httpchk
8404 tailnet HAProxy stats / /metrics Dallas stats UI + netdata scrape
8080 tailnet SFTPGo admin UI Dallas
19999 tailnet Netdata both
4005 tailnet Databasus UI Chicago
8025 tailnet Mailpit UI (staging) Dallas SMTP 1025 stays kamal-net-internal
6379 / 3000 kamal-net only Valkey ×3 / Playwright Dallas not host-published

Data tier (DB read/write path)

  • Write path: app DATABASE_HOST=heatwave-haproxy, DATABASE_PORT=6433
    HAProxy (TCP passthrough) → the live primary's pgbouncer :6432
    (session-mode — the app uses session advisory locks) → Postgres :5432.
    Both heatwave and heatwave_versions share one FDW-linked cluster.
  • Routing decision: HAProxy httpchks each node's pg-health :8008; only
    the node answering 200 "primary" is "up", so a pg_promote re-routes the
    write path with no app redeploy.
  • Read offload: the heatwave-db-ro Tailscale VIP follows the standby's
    pgbouncer for read-only consumers.
  • Cache/queue: RedisConfig (config/initializers/100_redis_config.rb)
    routes by logical DB to heatwave-valkey-{cache,sessions,queue}:6379 — see
    Valkey 3-Flavor Split and
    DB Tier HA Architecture.

Failover & maintenance tooling

  • bin/recovery <env> {flip-db,rebuild-standby,topology} — promote the
    standby + reroute the write VIP (cross-DC), rebuild a wiped node from a fresh
    basebackup, or print the current topology. Snapshots the demoted dataset
    (zfs snapshot) before wiping, as a rollback net.
  • bin/maintenance {up,down} <env> — full maintenance window (proxy 503 →
    stop web+sidekiq → on prod also stop the Chicago databasus controller container),
    and the reverse. See the PG18 Failover Runbook and
    HAProxy Routing Layer.

Backups & disaster recovery

  • Databasus PITR (backup-of-record) — the agentless databasus controller
    container on Chicago streams an encrypted physical base + WAL to Cloudflare R2
    bucket heatwave-postgres-backups-production (ENAM region, off-Latitude for DR).
    Restore to any chosen second via the controller UI (:4005) or bin/restore's
    physical/PITR option (wraps config/databasus/databasus-recovery.sh).
    AES-256-GCM key at /data/databasus-data/secret.key.
  • Databasus is itself disaster-recoverable (config-as-code) — the controller
    keeps its entire config (admin / workspace / R2 storage / sources / schedules /
    restore user) only in an embedded Postgres metadata DB on the host, so a host wipe
    loses it (it did, 2026-06-15) and only secret.key is in 1Password. script/setup_databasus.sh
    rebuilds the whole thing idempotently in one command; a nightly encrypted metadata
    snapshot
    (databasus-metadata-backup.timer → R2 databasus-metadata/) is the
    turnkey alternative. See config/databasus/README.md.
  • heatwave_versions archivebin/versions-partitions ships completed
    annual partitions (>5 yr) to R2 heatwave-versions-archive-production (cold).
  • bin/restore — pulls a recent logical pg_dump from Databasus→R2 to
    seed a dev database (this is the dev path, not DR). See the
    DR restore runbook and
    Databasus PITR.

Cloudflare R2 buckets (all off-Latitude; bucket-scoped S3 tokens in 1Password vault IT)

Bucket Region Purpose Producer
heatwave-postgres-backups-production ENAM Databasus PITR (physical base + WAL). databasus-metadata/ prefix = the encrypted controller-config snapshots Databasus agent + backup-metadata.sh
heatwave-versions-archive-production ENAM Cold annual heatwave_versions partitions (>5 yr) bin/versions-partitions
heatwave-frontend-assets-production ENAM Content-hashed webpack assets, served same-origin by the www-edge Worker (survives Kamal deploys) webpack build / deploy
heatwave-call-recordings-production Switchvox call recordings (WarmlyYours/ prefix); Sidekiq imports from here SFTPGo pbx user
heatwave-pbx-backups-production ENAM Switchvox PBX system backups (bucket root) SFTPGo pbx-backup user

R2 bucket location is pinned on first creation of a name — delete+recreate reuses the
original location (--location enam only applies to a fresh name). Tokens are minted via
script/setup_r2_* and stored in 1Password.

Edge & network protection (what protects it)

  • Ingress: Cloudflare → outbound-only cloudflared QUIC tunnel (host
    systemd) → kamal-proxy :80. No inbound web port is open on any host. TLS
    terminates at Cloudflare (proxy.ssl: false); the www-edge Worker fronts
    www/apex for locale redirects, R2 webpack-asset serving, and cache rules.
  • Firewall — three layers:
    1. Latitude edge firewall — allow SSH 22 from 100.64.0.0/10 (Tailscale)
      and SFTP 2222 from 144.202.57.170 (PBX); deny the rest.
    2. Host UFWdefault deny incoming; allow lo, tailscale0, 22/tcp,
      and 2222 from the PBX IP.
    3. DOCKER-USER iptables (/usr/local/sbin/docker-user-fw.sh) — drops
      public 80/443, allows the tailnet, and handles the SFTP DNAT gotcha
      (match conntrack --ctorigdstport 2222, not --dport, because Docker
      DNATs 2222→2022 before the rule sees it).
  • Cloudflare Access: staging onlycrm/www/api/mcp.warmlyyours.ws are
    gated to the wy-employees group (24 h session). Production has no CF
    Access
    by design (public site; Rails handles its own auth). The planned
    docs.warmlyyours.dev portal will be gated to @warmlyyours.com.
  • Tailnet: all operator surfaces (SSH, Netdata, HAProxy stats, SFTPGo UI,
    Databasus, Mailpit, psql) are Tailscale-only. Stable VIPs heatwave-db
    (→primary) and heatwave-db-ro (→standby) survive CHI↔DAL flips. The
    Tailscale ACL is Terraform-managed (infra/terraform/tailscale/).

Infrastructure as code (Terraform Cloud)

All infra is OpenTofu/Terraform under infra/terraform/, applied by HCP Terraform
(org warmlyyours), VCS-driven from this repo (auto-apply off — plans are reviewed,
applies are manual).

Workspace Module Manages
heatwave-latitude-production latitude/ The Latitude bare-metal server + cloud-init + per-host edge firewall
heatwave-host-config host-config/ Re-runs the idempotent provision-host.sh over SSH (postfix relay, pg-maintenance/logwatch timers, ZED) — no reinstall
heatwave-tailscale tailscale/ Tailnet ACL + the heatwave-db / heatwave-db-ro VIP services
heatwave-cloudflare-zone-{production,staging} cloudflare-zone-*/ CF zone rulesets (WAF, cache, transforms) — the IaC source of truth; the dashboard is read-only
  • TFC agent: a hashicorp/tfc-agent container on Chicago (--network host, pool
    heatwave-tailnet) runs the agent-mode workspaces' SSH provisioners — the one thing a
    hosted runner can't do to a tailnet-only host. Dials out to app.terraform.io only. It
    is re-established by re-running the container (token op://IT/TFC-agent-token (heatwave-tailnet)).
  • ⚠️ Latitude reinstall landmine (now guarded): a user_data change on
    latitudesh_server.host triggers a full server reinstall (all host data lost). On
    2026-06-15 an approved plan whose only effective diff was a provision-host.sh edit
    reinstalled the live Chicago standby — wiping the PG standby + the Databasus agent and
    killing the in-flight TFC agent (it runs on Chicago). latitude/main.tf now carries
    lifecycle { ignore_changes = [user_data, billing] }, so editing cloud-init /
    provision-host.sh is inert w.r.t. a running box (day-2 host config flows through
    host-config over SSH). A genuine rebuild is now explicit:
    tofu apply -replace=latitudesh_server.host. Full record:
    doc/tasks/202606151240_CHICAGO_REINSTALL_DR_RECOVERY.md.

Application observability

Tool Status Notes
AppSignal Live Current APM / exception / host-metric sink (Heatwave/production + /staging)
Netdata Live Per-second infra metrics; per-host agents (not a parent/child stream)
PgHero Live Rails-mounted PG performance dashboard
HyperDX Planned ClickHouse-backed traces/logs; intended to replace AppSignal; not in deploy.yml yet

ZFS pool DEGRADED/FAULTED alerting is handled by ZED (ZFS Event Daemon)
over the postfix→SendGrid relay, not Netdata (the container→host firewall
blocks zpool collection).

Staging (summary)

Staging co-locates on the Dallas box under the heatwave-staging- service
prefix. It mirrors the full prod topology (primary + same-host standby for
HAProxy-failover rehearsal, Valkey ×3, pgbouncer, HAProxy), but every port binds
127.0.0.1 (except Mailpit/Netdata on the tailnet, since prod owns the
tailnet :19999). Staging hostnames crm/www/api/mcp.warmlyyours.ws are behind
Cloudflare Access (wy-employees). Staging sends no real mail — it's
captured by Mailpit (see the admin-services table above).

Keeping this current

When config/deploy.yml / deploy.staging.yml, config/netdata/,
config/haproxy/, or infra/terraform/ change, update this page. The facts
here were extracted from those files on the date above; they are the source of
truth, this is the index.