Sidekiq Pro Zero-Downtime Deployment Strategy

Overview

This project uses Sidekiq Pro with a zero-downtime deployment strategy that eliminates Sidekiq::Shutdown errors during deployments.

How It Works

Deployment Flow

┌─────────────────────────────────────────────────────────────────┐
│ BEFORE DEPLOYMENT STARTS                                         │
│ ↓                                                                │
│ 1. Send TSTP signal (sidekiq:quiet)                             │
│    - Stops accepting NEW jobs immediately                        │
│    - Running jobs continue on OLD code                           │
│    - No interruptions, no Sidekiq::Shutdown errors              │
└─────────────────────────────────────────────────────────────────┘
                              ↓
┌─────────────────────────────────────────────────────────────────┐
│ DURING DEPLOYMENT                                                │
│ ↓                                                                │
│ 2. Deploy new code                                               │
│    - Upload assets                                               │
│    - Run migrations                                              │
│    - Publish new release                                         │
│    - Existing Sidekiq jobs finish on old code (no interruption) │
└─────────────────────────────────────────────────────────────────┘
                              ↓
┌─────────────────────────────────────────────────────────────────┐
│ AFTER DEPLOYMENT SUCCEEDS                                        │
│ ↓                                                                │
│ 3. Restart Sidekiq (sidekiq:restart)                            │
│    - Stop old processes (graceful, 60s timeout)                 │
│    - Start new processes with new code                          │
│    - Begin accepting jobs again                                 │
└─────────────────────────────────────────────────────────────────┘

Timeline Example

T+0s    → Deployment starts
T+0s    → Send TSTP signal to all Sidekiq processes
T+0s    → Sidekiq stops fetching new jobs (queue paused)
T+0-45s → Deploy code (upload, migrate, publish)
T+10s   → Running EDI job continues uninterrupted ✅
T+35s   → EDI job completes successfully ✅
T+45s   → Deployment finished, trigger sidekiq:restart
T+45s   → Old Sidekiq processes shutdown gracefully
T+46s   → New Sidekiq processes start with new code
T+46s   → Queue processing resumes ✅

Configuration

Capistrano Deploy Configuration

File: config/deploy.rb

# Sidekiq Pro zero-downtime deployment strategy:
before :starting, 'sidekiq:quiet'    # Quiet before deployment starts
after :finished, 'sidekiq:restart'    # Restart after deployment succeeds

Sidekiq Timeout Configuration

File: config/initializers/sidekiq.rb

config[:timeout] = 60

Gives jobs 60 seconds to complete during graceful shutdown
Prevents force-kill of jobs that are almost done
Applies to the restart phase (after deployment)

Systemd Service Configuration

File: config/deploy/templates/sidekiq.service.capistrano.erb

TimeoutStopSec=90

Gives systemd 90 seconds to wait for Sidekiq shutdown
Must be longer than Sidekiq timeout (60s) + buffer (30s)
Prevents systemd from sending SIGKILL prematurely

Capistrano Sidekiq Settings

File: config/deploy.rb

set :sidekiq_roles, :worker
set :sidekiq_default_hooks, false  # We control hooks manually
set :sidekiq_timeout, 60           # Matches Sidekiq initializer timeout

Benefits of This Approach

✅ Zero Job Interruptions

Before deployment: Jobs stop being queued but running jobs finish
During deployment: No jobs are interrupted (they run on old code)
After deployment: New jobs run on new code

✅ No Sidekiq::Shutdown Errors

The old approach (after :finished, 'sidekiq:restart_noblock') would:

Let jobs continue during deployment
Interrupt them when restarting after deployment
Cause Sidekiq::Shutdown exceptions

The new approach:

Pauses queue before deployment starts
Lets running jobs finish before code changes
No interruptions = no errors

✅ Graceful Queue Pause

The quiet signal (TSTP) is specifically designed for deployments:

Instant: Stops fetching new jobs immediately
Safe: Doesn't interrupt running jobs
Reversible: If deployment fails, can un-quiet

✅ Predictable Behavior

Old jobs always run on old code (no mid-flight code changes)
New jobs always run on new code
Clear boundary between old and new

Available Capistrano Tasks

# View all Sidekiq tasks
cap production sidekiq -T

# Common tasks
cap production sidekiq:quiet        # Stop accepting new jobs (TSTP signal)
cap production sidekiq:restart      # Graceful restart (stop + start)
cap production sidekiq:stop         # Graceful stop (60s timeout)
cap production sidekiq:start        # Start Sidekiq processes
cap production sidekiq:install      # Install systemd service
cap production sidekiq:status       # Check Sidekiq status

Sidekiq Signals Reference

Signal	Command	Effect	Use Case
TSTP	`sidekiq:quiet`	Stop accepting new jobs, continue running jobs	Deployments (before code change)
TERM	`sidekiq:stop`	Graceful shutdown (60s timeout)	Normal shutdown
INT	Same as TERM	Graceful shutdown	Ctrl+C / manual stop
TTIN	N/A	Print thread backtraces to log	Debugging hung jobs
KILL	Force kill	Immediate termination (no cleanup)	Emergency only

What Happens to Jobs During Quiet?

Jobs Already Running

✅ Continue uninterrupted until completion or timeout (60s)

Jobs in Redis Queue

⏸️ Remain queued - will be processed after new Sidekiq starts

New Jobs Enqueued During Deployment

⏸️ Remain queued - will be processed after new Sidekiq starts

Critical Jobs That Can't Wait

If you have truly critical jobs that must process immediately:

Option 1: Schedule around deployments

# Deploy during low-traffic periods
# Avoid deploying during critical job windows

Option 2: Run separate "critical" Sidekiq process

# config/sidekiq_critical.yml
:concurrency: 2
:queues:
  - [critical, 2]  # Only critical jobs

# Don't quiet this one during deployments
set :sidekiq_config_files, ['sidekiq.yml']  # Exclude critical

Option 3: Use scheduled jobs instead of immediate

# Instead of perform_async (immediate)
MyWorker.perform_in(5.minutes, args)  # Delayed

Deployment Timing Considerations

How Long Does Quiet Phase Last?

The quiet phase lasts as long as your deployment takes:

Deployment Duration = 
  Upload Assets (~10-30s) +
  Run Migrations (~5-60s) +
  Publish Release (~5s) +
  Other Hooks (~10s)
  ≈ 30-105 seconds typical

During this time:

⏸️ New jobs queue up in Redis (not lost)
✅ Running jobs complete
📊 Monitor queue depth in Sidekiq Web UI

If Queue Builds Up

Most jobs can wait 30-60 seconds, but if queues grow too large:

Solution 1: Faster deployments

Optimize asset compilation (already done with local builds)
Use zero-downtime migrations (already common practice)
Parallelize upload tasks

Solution 2: Multiple worker servers

# Deploy to servers one at a time (rolling deployment)
# Some workers always available

Solution 3: Pre-quiet strategy

# Quiet 30 seconds before deployment to drain queue
before :starting, 'sidekiq:custom_quiet_and_wait'

task :custom_quiet_and_wait do
  invoke 'sidekiq:quiet'
  puts "Waiting 30s for queue to drain..."
  sleep 30
end

Monitoring and Verification

After Deployment

# SSH to production server
ssh deploy@chi-vultr-heatwave-util1

# Check all Sidekiq services are running
systemctl status 'sidekiq*.service' --no-pager

# Check processes are using new code
ps aux | grep sidekiq
# Look for new PID and recent start time

# Check logs for clean restart
journalctl -u sidekiq-heatwave-production-sidekiq -n 50

# Monitor queue in Sidekiq Web UI
# https://crm.warmlyyours.me:3000/sidekiq
# Check for:
# - Queue depth (should drain after restart)
# - No Sidekiq::Shutdown errors in dead jobs
# - Processed jobs resuming

In Rollbar

Before this change:

❌ Frequent Sidekiq::Shutdown exceptions
❌ Jobs interrupted during API calls
❌ Incomplete data synchronization

After this change:

✅ No Sidekiq::Shutdown during deployments
✅ Jobs complete or wait in queue
✅ Clean shutdowns only

Troubleshooting

Queue Not Processing After Deployment

Symptom: Jobs stuck in queue, not processing

Check:

# Are Sidekiq processes running?
systemctl status 'sidekiq*.service'

# If not running, start them
cap production sidekiq:start

# Check logs
journalctl -u sidekiq-heatwave-production-sidekiq -f

Jobs Still Being Interrupted

Symptom: Still seeing Sidekiq::Shutdown in Rollbar

Possible causes:

Jobs exceed 60s timeout
- Solution: Increase timeout or break into smaller jobs
- See doc/SIDEKIQ_GRACEFUL_SHUTDOWN.md for details
Manual restarts during deployment
- Check: Are you running cap sidekiq:restart manually?
- Solution: Let Capistrano handle restarts automatically
Systemd watchdog killing jobs
- Check: journalctl for "Watchdog timeout"
- Solution: Increase WatchdogSec in service file

Deployment Hangs at "Quieting Sidekiq"

Symptom: Deployment stuck at sidekiq:quiet task

Check:

# Are Sidekiq processes responding?
ssh deploy@server 'systemctl is-active sidekiq*.service'

# Can you manually quiet?
ssh deploy@server 'systemctl kill -s TSTP sidekiq-heatwave-production-sidekiq.service'

Solution:

Increase SSH timeout
Check network connectivity
Verify systemd is responsive

Rollback Strategy

If a deployment fails or needs rollback:

# Automatic rollback on failure
cap production deploy:rollback

# Sidekiq will restart with previous code version
# Jobs in queue will process with rolled-back code

Best Practices Summary

✅ Always use quiet before deployment (configured automatically)
✅ Let Capistrano manage Sidekiq lifecycle (don't manual restart)
✅ Keep jobs under 60 seconds when possible
✅ Make jobs idempotent (safe to retry)
✅ Monitor queue depth during deployments
✅ Deploy during low-traffic periods for critical systems
✅ Test deployments in staging with realistic job load

Additional Resources

Last Updated: October 10, 2025
Configuration Version: Sidekiq Pro 7.3.x, Rails 7.0.8.7, Capistrano 3.19.2