Sidekiq Pro Zero-Downtime Deployment Strategy

Overview

This project uses Sidekiq Pro with a zero-downtime deployment strategy that eliminates Sidekiq::Shutdown errors during deployments.

How It Works

Deployment Flow

┌─────────────────────────────────────────────────────────────────┐
│ BEFORE DEPLOYMENT STARTS                                         │
│ ↓                                                                │
│ 1. Send TSTP signal (sidekiq:quiet)                             │
│    - Stops accepting NEW jobs immediately                        │
│    - Running jobs continue on OLD code                           │
│    - No interruptions, no Sidekiq::Shutdown errors              │
└─────────────────────────────────────────────────────────────────┘
                              ↓
┌─────────────────────────────────────────────────────────────────┐
│ DURING DEPLOYMENT                                                │
│ ↓                                                                │
│ 2. Deploy new code                                               │
│    - Upload assets                                               │
│    - Run migrations                                              │
│    - Publish new release                                         │
│    - Existing Sidekiq jobs finish on old code (no interruption) │
└─────────────────────────────────────────────────────────────────┘
                              ↓
┌─────────────────────────────────────────────────────────────────┐
│ AFTER DEPLOYMENT SUCCEEDS                                        │
│ ↓                                                                │
│ 3. Restart Sidekiq (sidekiq:restart)                            │
│    - Stop old processes (graceful, 60s timeout)                 │
│    - Start new processes with new code                          │
│    - Begin accepting jobs again                                 │
└─────────────────────────────────────────────────────────────────┘

Timeline Example

T+0s    → Deployment starts
T+0s    → Send TSTP signal to all Sidekiq processes
T+0s    → Sidekiq stops fetching new jobs (queue paused)
T+0-45s → Deploy code (upload, migrate, publish)
T+10s   → Running EDI job continues uninterrupted ✅
T+35s   → EDI job completes successfully ✅
T+45s   → Deployment finished, trigger sidekiq:restart
T+45s   → Old Sidekiq processes shutdown gracefully
T+46s   → New Sidekiq processes start with new code
T+46s   → Queue processing resumes ✅

Configuration

Capistrano Deploy Configuration

File: config/deploy.rb

# Sidekiq Pro zero-downtime deployment strategy:
before :starting, 'sidekiq:quiet'    # Quiet before deployment starts
after :finished, 'sidekiq:restart'    # Restart after deployment succeeds

Sidekiq Timeout Configuration

File: config/initializers/sidekiq.rb

config[:timeout] = 60
  • Gives jobs 60 seconds to complete during graceful shutdown
  • Prevents force-kill of jobs that are almost done
  • Applies to the restart phase (after deployment)

Systemd Service Configuration

File: config/deploy/templates/sidekiq.service.capistrano.erb

TimeoutStopSec=90
  • Gives systemd 90 seconds to wait for Sidekiq shutdown
  • Must be longer than Sidekiq timeout (60s) + buffer (30s)
  • Prevents systemd from sending SIGKILL prematurely

Capistrano Sidekiq Settings

File: config/deploy.rb

set :sidekiq_roles, :worker
set :sidekiq_default_hooks, false  # We control hooks manually
set :sidekiq_timeout, 60           # Matches Sidekiq initializer timeout

Benefits of This Approach

✅ Zero Job Interruptions

  • Before deployment: Jobs stop being queued but running jobs finish
  • During deployment: No jobs are interrupted (they run on old code)
  • After deployment: New jobs run on new code

✅ No Sidekiq::Shutdown Errors

The old approach (after :finished, 'sidekiq:restart_noblock') would:

  • Let jobs continue during deployment
  • Interrupt them when restarting after deployment
  • Cause Sidekiq::Shutdown exceptions

The new approach:

  • Pauses queue before deployment starts
  • Lets running jobs finish before code changes
  • No interruptions = no errors

✅ Graceful Queue Pause

The quiet signal (TSTP) is specifically designed for deployments:

  • Instant: Stops fetching new jobs immediately
  • Safe: Doesn't interrupt running jobs
  • Reversible: If deployment fails, can un-quiet

✅ Predictable Behavior

  • Old jobs always run on old code (no mid-flight code changes)
  • New jobs always run on new code
  • Clear boundary between old and new

Available Capistrano Tasks

# View all Sidekiq tasks
cap production sidekiq -T

# Common tasks
cap production sidekiq:quiet        # Stop accepting new jobs (TSTP signal)
cap production sidekiq:restart      # Graceful restart (stop + start)
cap production sidekiq:stop         # Graceful stop (60s timeout)
cap production sidekiq:start        # Start Sidekiq processes
cap production sidekiq:install      # Install systemd service
cap production sidekiq:status       # Check Sidekiq status

Sidekiq Signals Reference

Signal Command Effect Use Case
TSTP sidekiq:quiet Stop accepting new jobs, continue running jobs Deployments (before code change)
TERM sidekiq:stop Graceful shutdown (60s timeout) Normal shutdown
INT Same as TERM Graceful shutdown Ctrl+C / manual stop
TTIN N/A Print thread backtraces to log Debugging hung jobs
KILL Force kill Immediate termination (no cleanup) Emergency only

What Happens to Jobs During Quiet?

Jobs Already Running

Continue uninterrupted until completion or timeout (60s)

Jobs in Redis Queue

⏸️ Remain queued - will be processed after new Sidekiq starts

New Jobs Enqueued During Deployment

⏸️ Remain queued - will be processed after new Sidekiq starts

Critical Jobs That Can't Wait

If you have truly critical jobs that must process immediately:

Option 1: Schedule around deployments

# Deploy during low-traffic periods
# Avoid deploying during critical job windows

Option 2: Run separate "critical" Sidekiq process

# config/sidekiq_critical.yml
:concurrency: 2
:queues:
  - [critical, 2]  # Only critical jobs
# Don't quiet this one during deployments
set :sidekiq_config_files, ['sidekiq.yml']  # Exclude critical

Option 3: Use scheduled jobs instead of immediate

# Instead of perform_async (immediate)
MyWorker.perform_in(5.minutes, args)  # Delayed

Deployment Timing Considerations

How Long Does Quiet Phase Last?

The quiet phase lasts as long as your deployment takes:

Deployment Duration = 
  Upload Assets (~10-30s) +
  Run Migrations (~5-60s) +
  Publish Release (~5s) +
  Other Hooks (~10s)
  ≈ 30-105 seconds typical

During this time:

  • ⏸️ New jobs queue up in Redis (not lost)
  • ✅ Running jobs complete
  • 📊 Monitor queue depth in Sidekiq Web UI

If Queue Builds Up

Most jobs can wait 30-60 seconds, but if queues grow too large:

Solution 1: Faster deployments

  • Optimize asset compilation (already done with local builds)
  • Use zero-downtime migrations (already common practice)
  • Parallelize upload tasks

Solution 2: Multiple worker servers

# Deploy to servers one at a time (rolling deployment)
# Some workers always available

Solution 3: Pre-quiet strategy

# Quiet 30 seconds before deployment to drain queue
before :starting, 'sidekiq:custom_quiet_and_wait'

task :custom_quiet_and_wait do
  invoke 'sidekiq:quiet'
  puts "Waiting 30s for queue to drain..."
  sleep 30
end

Monitoring and Verification

After Deployment

# SSH to production server
ssh deploy@chi-vultr-heatwave-util1

# Check all Sidekiq services are running
systemctl status 'sidekiq*.service' --no-pager

# Check processes are using new code
ps aux | grep sidekiq
# Look for new PID and recent start time

# Check logs for clean restart
journalctl -u sidekiq-heatwave-production-sidekiq -n 50

# Monitor queue in Sidekiq Web UI
# https://crm.warmlyyours.me:3000/sidekiq
# Check for:
# - Queue depth (should drain after restart)
# - No Sidekiq::Shutdown errors in dead jobs
# - Processed jobs resuming

In Rollbar

Before this change:

 Frequent Sidekiq::Shutdown exceptions
 Jobs interrupted during API calls
 Incomplete data synchronization

After this change:

 No Sidekiq::Shutdown during deployments
 Jobs complete or wait in queue
 Clean shutdowns only

Troubleshooting

Queue Not Processing After Deployment

Symptom: Jobs stuck in queue, not processing

Check:

# Are Sidekiq processes running?
systemctl status 'sidekiq*.service'

# If not running, start them
cap production sidekiq:start

# Check logs
journalctl -u sidekiq-heatwave-production-sidekiq -f

Jobs Still Being Interrupted

Symptom: Still seeing Sidekiq::Shutdown in Rollbar

Possible causes:

  1. Jobs exceed 60s timeout

    • Solution: Increase timeout or break into smaller jobs
    • See doc/SIDEKIQ_GRACEFUL_SHUTDOWN.md for details
  2. Manual restarts during deployment

    • Check: Are you running cap sidekiq:restart manually?
    • Solution: Let Capistrano handle restarts automatically
  3. Systemd watchdog killing jobs

    • Check: journalctl for "Watchdog timeout"
    • Solution: Increase WatchdogSec in service file

Deployment Hangs at "Quieting Sidekiq"

Symptom: Deployment stuck at sidekiq:quiet task

Check:

# Are Sidekiq processes responding?
ssh deploy@server 'systemctl is-active sidekiq*.service'

# Can you manually quiet?
ssh deploy@server 'systemctl kill -s TSTP sidekiq-heatwave-production-sidekiq.service'

Solution:

  • Increase SSH timeout
  • Check network connectivity
  • Verify systemd is responsive

Rollback Strategy

If a deployment fails or needs rollback:

# Automatic rollback on failure
cap production deploy:rollback

# Sidekiq will restart with previous code version
# Jobs in queue will process with rolled-back code

Best Practices Summary

  1. Always use quiet before deployment (configured automatically)
  2. Let Capistrano manage Sidekiq lifecycle (don't manual restart)
  3. Keep jobs under 60 seconds when possible
  4. Make jobs idempotent (safe to retry)
  5. Monitor queue depth during deployments
  6. Deploy during low-traffic periods for critical systems
  7. Test deployments in staging with realistic job load

Additional Resources


Last Updated: October 10, 2025
Configuration Version: Sidekiq Pro 7.3.x, Rails 7.0.8.7, Capistrano 3.19.2