Valkey three-flavor split — cache / sessions / queue
Status: PR 1 implemented (config + code). The cutover is a separate, deliberate
step on the team’s schedule — NOT performed by this PR. Splits the single
consolidated Valkey 9.1 accessory into three purpose-built instances so each gets
its correct eviction/durability policy (maxmemory-policy is instance-wide), and
so the eventual cross-DC replication has to cover only the tiny durable store.
Supersedes the single-instance design in
202606062037_REDIS_VALKEY_CUTOVER_PLAN.md.
Target HA model = active/passive (one region serves, the other warm-standby).
0. Timing & safety (read first)
Section titled “0. Timing & safety (read first)”- Production is one file now —
config/deploy.yml(barekamal deploy; the per-DC destination files were removed in the DB-tier merge). Currently Dallas-primary (100.123.47.52); a CHI↔DAL flip swaps the host IPs in that file (the valkey accessories’host:lines move with the rest). - The app code is backward-compatible: with only the legacy
REDIS_HOSTset, all three flavors resolve to the one instance. The split activates only where the three accessories andREDIS_{CACHE,SESSIONS,QUEUE}_HOSTare deployed. - Nothing activates on deploy alone —
kamal deploynever boots accessories. The split goes live only when you boot the three and deploy (§3). - Staging validation is independent of prod — do it whenever.
1. Why split
Section titled “1. Why split”One instance served must-not-evict Sidekiq (DB 3, no TTL) and TTL’d
caches/sessions on a single volatile-lru compromise. That policy only ever
protected the no-TTL Sidekiq keys — sessions carry a 7-day TTL
(RedisCacheStore expires_in: 7.days), so a fragment/api-cache fill to maxmemory
could silently evict cold sessions = involuntary logout. Splitting lets each store
get the right policy, and shrinks PR 2’s replication surface to the ~11k-key queue.
| flavor | logical DBs | policy | persistence | replicated | scope |
|---|---|---|---|---|---|
| cache | 1 geocoder · 2 Action Cable · 4 Rails.cache · 5 api_cache | allkeys-lru | none | no | per app server |
| sessions | 0 sessions | noeviction | none | no | per app server |
| queue | 3 Sidekiq | noeviction | AOF + RDB | yes (PR 2) | one primary |
Under active/passive the cross-region wrinkles vanish: only one region serves, so per-region cache / Action Cable pub-sub / rate-limit counters are all correct, and session loss on failover is one re-login.
2. PR 1 — the split (this change, no replication yet)
Section titled “2. PR 1 — the split (this change, no replication yet)”App code
config/initializers/100_redis_config.rb—RedisConfigmaps each logical DB → a flavor (FLAVORS) and resolves the per-flavor config block. Every caller already routed by DB, so no call site changed. Back-compat: a config with no flavor keys falls back to the top-level block;REDIS_*_HOSTeach fall back toREDIS_HOST.config/redis_consolidated.example.yml— three blocks per env; Kamal drivesREDIS_{CACHE,SESSIONS,QUEUE}_HOST. (Dockerfile copies →redis_consolidated.yml.)test/lib/redis_config_test.rb— routing + fallback.
Valkey configs (env-agnostic, role-based)
config/valkey/cache.conf—allkeys-lru,save "",appendonly no.config/valkey/sessions.conf—noeviction,save "",appendonly no.config/valkey/queue.conf—noeviction,appendonly yes, RDBsaverules.config/valkey/production.conf— deleted (only the now-removed singlevalkeyaccessory mounted it).
Deploy — the single valkey accessory becomes three internal-only accessories
(underscore key + service: override → heatwave[-staging]-valkey-{cache,sessions,queue},
matching the pg_health/postgres_replica convention; no host port, ops via
kamal accessory exec):
config/deploy.yml(prod, Dallas) — split + the threeREDIS_*_HOSTenv pairs.config/deploy.staging.yml— split (validation env).docker-compose.yml(dev) — unchanged: one container, three flavors →127.0.0.1:6379.
3. Cutover — boot accessories BEFORE the app picks up the new env
Section titled “3. Cutover — boot accessories BEFORE the app picks up the new env”The three are internal-only, so they boot alongside the live valkey (no host-port
clash). Ops/inspection via kamal accessory exec <name> -d <dest>.
Staging (validate first; independent of prod)
Section titled “Staging (validate first; independent of prod)”The three are internal-only with their own kamal volumes, so they boot alongside the
live valkey — no port/volume clash:
kamal accessory boot valkey_cache valkey_sessions valkey_queue -d staging.kamal deploy -d staging(app switches to the three).- Verify: admin Redis panel (
Admin::AdminController) shows each service on its own host; confirm login (sessions), a cached page, and a Sidekiq job. kamal accessory remove valkey -d staging(the now-orphaned single instance).
Production (config/deploy.yml; on the current primary = Dallas)
Section titled “Production (config/deploy.yml; on the current primary = Dallas)”All three are internal-only with their own kamal-managed volume, so they boot with the
live valkey still serving — no port/volume clash, minimal disruption:
kamal accessory boot valkey_cache valkey_sessions valkey_queue(livevalkeykeeps serving).- Copy the queue’s durable sets (DB 3:
scheduled/retry/dead) from the livevalkeyintovalkey_queue(kamal accessory exec→valkey-cli --rdborDUMP+RESTORE; small). Cache + sessions need no copy. kamal deploy— app switchesREDIS_HOST→ the threeREDIS_*_HOST(sessions re-login once; caches re-warm).kamal accessory remove valkey(the now-orphaned single instance).
Rollback: revert the env to the single REDIS_HOST + redeploy → the app falls
back to the live valkey (still running until step 4). No code revert.
4. PR 2 — replicate the queue (separate, after the split lands)
Section titled “4. PR 2 — replicate the queue (separate, after the split lands)”Only valkey_queue gets a cross-DC replica (replicaof) plus the tailnet bind +
a requirepass/ACL (added together — never expose 6379 on the tailnet without auth).
Option-1 manual promote (REPLICAOF NO ONE next to pg_promote() in the routing-layer
runbook). valkey_cache and valkey_sessions are never replicated. Detail in a
follow-up doc.