Valkey three-flavor split — cache / sessions / queue

Status: PR 1 implemented (config + code). The cutover is a separate, deliberate step on the team’s schedule — NOT performed by this PR. Splits the single consolidated Valkey 9.1 accessory into three purpose-built instances so each gets its correct eviction/durability policy (maxmemory-policy is instance-wide), and so the eventual cross-DC replication has to cover only the tiny durable store. Supersedes the single-instance design in 202606062037_REDIS_VALKEY_CUTOVER_PLAN.md. Target HA model = active/passive (one region serves, the other warm-standby).

0. Timing & safety (read first)

Production is one file now — config/deploy.yml (bare kamal deploy; the per-DC destination files were removed in the DB-tier merge). Currently Dallas-primary (100.123.47.52); a CHI↔DAL flip swaps the host IPs in that file (the valkey accessories’ host: lines move with the rest).
The app code is backward-compatible: with only the legacy REDIS_HOST set, all three flavors resolve to the one instance. The split activates only where the three accessories and REDIS_{CACHE,SESSIONS,QUEUE}_HOST are deployed.
Nothing activates on deploy alone — kamal deploy never boots accessories. The split goes live only when you boot the three and deploy (§3).
Staging validation is independent of prod — do it whenever.

1. Why split

One instance served must-not-evict Sidekiq (DB 3, no TTL) and TTL’d caches/sessions on a single volatile-lru compromise. That policy only ever protected the no-TTL Sidekiq keys — sessions carry a 7-day TTL (RedisCacheStore expires_in: 7.days), so a fragment/api-cache fill to maxmemory could silently evict cold sessions = involuntary logout. Splitting lets each store get the right policy, and shrinks PR 2’s replication surface to the ~11k-key queue.

flavor	logical DBs	policy	persistence	replicated	scope
cache	1 geocoder · 2 Action Cable · 4 Rails.cache · 5 api_cache	`allkeys-lru`	none	no	per app server
sessions	0 sessions	`noeviction`	none	no	per app server
queue	3 Sidekiq	`noeviction`	AOF + RDB	yes (PR 2)	one primary

Under active/passive the cross-region wrinkles vanish: only one region serves, so per-region cache / Action Cable pub-sub / rate-limit counters are all correct, and session loss on failover is one re-login.

2. PR 1 — the split (this change, no replication yet)

App code

config/initializers/100_redis_config.rb — RedisConfig maps each logical DB → a flavor (FLAVORS) and resolves the per-flavor config block. Every caller already routed by DB, so no call site changed. Back-compat: a config with no flavor keys falls back to the top-level block; REDIS_*_HOST each fall back to REDIS_HOST.
config/redis_consolidated.example.yml — three blocks per env; Kamal drives REDIS_{CACHE,SESSIONS,QUEUE}_HOST. (Dockerfile copies → redis_consolidated.yml.)
test/lib/redis_config_test.rb — routing + fallback.

Valkey configs (env-agnostic, role-based)

config/valkey/cache.conf — allkeys-lru, save "", appendonly no.
config/valkey/sessions.conf — noeviction, save "", appendonly no.
config/valkey/queue.conf — noeviction, appendonly yes, RDB save rules.
config/valkey/production.conf — deleted (only the now-removed single valkey accessory mounted it).

Deploy — the single valkey accessory becomes three internal-only accessories (underscore key + service: override → heatwave[-staging]-valkey-{cache,sessions,queue}, matching the pg_health/postgres_replica convention; no host port, ops via kamal accessory exec):

config/deploy.yml (prod, Dallas) — split + the three REDIS_*_HOST env pairs.
config/deploy.staging.yml — split (validation env).
docker-compose.yml (dev) — unchanged: one container, three flavors → 127.0.0.1:6379.

3. Cutover — boot accessories BEFORE the app picks up the new env

The three are internal-only, so they boot alongside the live valkey (no host-port clash). Ops/inspection via kamal accessory exec <name> -d <dest>.

Staging (validate first; independent of prod)

The three are internal-only with their own kamal volumes, so they boot alongside the live valkey — no port/volume clash:

kamal accessory boot valkey_cache valkey_sessions valkey_queue -d staging.
kamal deploy -d staging (app switches to the three).
Verify: admin Redis panel (Admin::AdminController) shows each service on its own host; confirm login (sessions), a cached page, and a Sidekiq job.
kamal accessory remove valkey -d staging (the now-orphaned single instance).

Production (`config/deploy.yml`; on the current primary = Dallas)

All three are internal-only with their own kamal-managed volume, so they boot with the live valkey still serving — no port/volume clash, minimal disruption:

kamal accessory boot valkey_cache valkey_sessions valkey_queue (live valkey keeps serving).
Copy the queue’s durable sets (DB 3: scheduled/retry/dead) from the live valkey into valkey_queue (kamal accessory exec → valkey-cli --rdb or DUMP+RESTORE; small). Cache + sessions need no copy.
kamal deploy — app switches REDIS_HOST → the three REDIS_*_HOST (sessions re-login once; caches re-warm).
kamal accessory remove valkey (the now-orphaned single instance).

Rollback: revert the env to the single REDIS_HOST + redeploy → the app falls back to the live valkey (still running until step 4). No code revert.

4. PR 2 — replicate the queue (separate, after the split lands)

Only valkey_queue gets a cross-DC replica (replicaof) plus the tailnet bind + a requirepass/ACL (added together — never expose 6379 on the tailnet without auth). Option-1 manual promote (REPLICAOF NO ONE next to pg_promote() in the routing-layer runbook). valkey_cache and valkey_sessions are never replicated. Detail in a follow-up doc.