Skip to content

Valkey three-flavor split — cache / sessions / queue

Status: PR 1 implemented (config + code). The cutover is a separate, deliberate step on the team’s schedule — NOT performed by this PR. Splits the single consolidated Valkey 9.1 accessory into three purpose-built instances so each gets its correct eviction/durability policy (maxmemory-policy is instance-wide), and so the eventual cross-DC replication has to cover only the tiny durable store. Supersedes the single-instance design in 202606062037_REDIS_VALKEY_CUTOVER_PLAN.md. Target HA model = active/passive (one region serves, the other warm-standby).

  • Production is one file nowconfig/deploy.yml (bare kamal deploy; the per-DC destination files were removed in the DB-tier merge). Currently Dallas-primary (100.123.47.52); a CHI↔DAL flip swaps the host IPs in that file (the valkey accessories’ host: lines move with the rest).
  • The app code is backward-compatible: with only the legacy REDIS_HOST set, all three flavors resolve to the one instance. The split activates only where the three accessories and REDIS_{CACHE,SESSIONS,QUEUE}_HOST are deployed.
  • Nothing activates on deploy alonekamal deploy never boots accessories. The split goes live only when you boot the three and deploy (§3).
  • Staging validation is independent of prod — do it whenever.

One instance served must-not-evict Sidekiq (DB 3, no TTL) and TTL’d caches/sessions on a single volatile-lru compromise. That policy only ever protected the no-TTL Sidekiq keys — sessions carry a 7-day TTL (RedisCacheStore expires_in: 7.days), so a fragment/api-cache fill to maxmemory could silently evict cold sessions = involuntary logout. Splitting lets each store get the right policy, and shrinks PR 2’s replication surface to the ~11k-key queue.

flavorlogical DBspolicypersistencereplicatedscope
cache1 geocoder · 2 Action Cable · 4 Rails.cache · 5 api_cacheallkeys-lrunonenoper app server
sessions0 sessionsnoevictionnonenoper app server
queue3 SidekiqnoevictionAOF + RDByes (PR 2)one primary

Under active/passive the cross-region wrinkles vanish: only one region serves, so per-region cache / Action Cable pub-sub / rate-limit counters are all correct, and session loss on failover is one re-login.

2. PR 1 — the split (this change, no replication yet)

Section titled “2. PR 1 — the split (this change, no replication yet)”

App code

  • config/initializers/100_redis_config.rbRedisConfig maps each logical DB → a flavor (FLAVORS) and resolves the per-flavor config block. Every caller already routed by DB, so no call site changed. Back-compat: a config with no flavor keys falls back to the top-level block; REDIS_*_HOST each fall back to REDIS_HOST.
  • config/redis_consolidated.example.yml — three blocks per env; Kamal drives REDIS_{CACHE,SESSIONS,QUEUE}_HOST. (Dockerfile copies → redis_consolidated.yml.)
  • test/lib/redis_config_test.rb — routing + fallback.

Valkey configs (env-agnostic, role-based)

  • config/valkey/cache.confallkeys-lru, save "", appendonly no.
  • config/valkey/sessions.confnoeviction, save "", appendonly no.
  • config/valkey/queue.confnoeviction, appendonly yes, RDB save rules.
  • config/valkey/production.confdeleted (only the now-removed single valkey accessory mounted it).

Deploy — the single valkey accessory becomes three internal-only accessories (underscore key + service: override → heatwave[-staging]-valkey-{cache,sessions,queue}, matching the pg_health/postgres_replica convention; no host port, ops via kamal accessory exec):

  • config/deploy.yml (prod, Dallas) — split + the three REDIS_*_HOST env pairs.
  • config/deploy.staging.yml — split (validation env).
  • docker-compose.yml (dev) — unchanged: one container, three flavors → 127.0.0.1:6379.

3. Cutover — boot accessories BEFORE the app picks up the new env

Section titled “3. Cutover — boot accessories BEFORE the app picks up the new env”

The three are internal-only, so they boot alongside the live valkey (no host-port clash). Ops/inspection via kamal accessory exec <name> -d <dest>.

Staging (validate first; independent of prod)

Section titled “Staging (validate first; independent of prod)”

The three are internal-only with their own kamal volumes, so they boot alongside the live valkey — no port/volume clash:

  1. kamal accessory boot valkey_cache valkey_sessions valkey_queue -d staging.
  2. kamal deploy -d staging (app switches to the three).
  3. Verify: admin Redis panel (Admin::AdminController) shows each service on its own host; confirm login (sessions), a cached page, and a Sidekiq job.
  4. kamal accessory remove valkey -d staging (the now-orphaned single instance).

Production (config/deploy.yml; on the current primary = Dallas)

Section titled “Production (config/deploy.yml; on the current primary = Dallas)”

All three are internal-only with their own kamal-managed volume, so they boot with the live valkey still serving — no port/volume clash, minimal disruption:

  1. kamal accessory boot valkey_cache valkey_sessions valkey_queue (live valkey keeps serving).
  2. Copy the queue’s durable sets (DB 3: scheduled/retry/dead) from the live valkey into valkey_queue (kamal accessory execvalkey-cli --rdb or DUMP+RESTORE; small). Cache + sessions need no copy.
  3. kamal deploy — app switches REDIS_HOST → the three REDIS_*_HOST (sessions re-login once; caches re-warm).
  4. kamal accessory remove valkey (the now-orphaned single instance).

Rollback: revert the env to the single REDIS_HOST + redeploy → the app falls back to the live valkey (still running until step 4). No code revert.

4. PR 2 — replicate the queue (separate, after the split lands)

Section titled “4. PR 2 — replicate the queue (separate, after the split lands)”

Only valkey_queue gets a cross-DC replica (replicaof) plus the tailnet bind + a requirepass/ACL (added together — never expose 6379 on the tailnet without auth). Option-1 manual promote (REPLICAOF NO ONE next to pg_promote() in the routing-layer runbook). valkey_cache and valkey_sessions are never replicated. Detail in a follow-up doc.