Sidekiq Graceful Shutdown Configuration

Problem

During deployments, Sidekiq workers were being forcibly terminated before jobs could complete, resulting in Sidekiq::Shutdown exceptions. This was particularly problematic for:

  • Long-running EDI API calls (Amazon, etc.)
  • HTTP requests with slow response times
  • Jobs that take longer than the default 25-second timeout

Example Error:

Sidekiq::Shutdown at HTTP::Connection#read_headers!

Solution

We've implemented a three-layer graceful shutdown strategy:

1. Sidekiq Timeout Configuration

File: config/initializers/sidekiq.rb

config[:timeout] = 60
  • What it does: Tells Sidekiq to wait up to 60 seconds for running jobs to complete before forcing shutdown
  • Why 60 seconds: Accommodates typical EDI API call times, including retries and network delays
  • How it works: When Sidekiq receives a TERM signal (during deployment), it:
    1. Stops accepting new jobs immediately
    2. Waits for running jobs to complete (up to 60 seconds)
    3. Only force-kills jobs that exceed this timeout

2. Systemd Service Timeout

File: config/deploy/templates/sidekiq.service.capistrano.erb

TimeoutStopSec=90
  • What it does: Tells systemd to wait 90 seconds for Sidekiq to gracefully shut down
  • Why 90 seconds: Must be longer than Sidekiq's timeout (60s) + buffer for cleanup (30s)
  • How it works: If systemd doesn't receive confirmation of shutdown within 90 seconds, it will send SIGKILL

3. Capistrano Configuration

File: config/deploy.rb

set :sidekiq_timeout, 60
  • What it does: Ensures Capistrano waits for Sidekiq to shut down properly before proceeding with deployment
  • Why it matches: Should align with Sidekiq's internal timeout for consistency

How Graceful Shutdown Works

Normal Shutdown Flow (During Deployment)

  1. Capistrano triggers Sidekiq restart via sidekiq:restart_noblock
  2. Systemd sends SIGTERM to Sidekiq process
  3. Sidekiq enters shutdown mode:
    • Stops fetching new jobs from Redis
    • Marks itself as "quiet" (won't accept work)
    • Waits for currently executing jobs to complete
  4. Jobs have 60 seconds to finish:
    • Jobs that complete within 60s: ✅ Success, no errors
    • Jobs exceeding 60s: ⚠️ Receive Sidekiq::Shutdown exception
  5. Sidekiq exits cleanly after all jobs complete or timeout
  6. Systemd starts new Sidekiq process with updated code
  7. Capistrano continues deployment

Timeline Example

T+0s   → Deployment starts, systemd sends SIGTERM
T+0s   → Sidekiq stops accepting new jobs
T+0s   → Running EDI job continues (waiting for API response)
T+45s  → EDI job completes successfully ✅
T+46s  → Sidekiq shuts down gracefully
T+47s  → New Sidekiq process starts with updated code

What Happens to Long-Running Jobs?

Jobs under 60 seconds:

  • Complete normally
  • No exceptions raised
  • Results saved successfully

Jobs over 60 seconds:

  • Receive Sidekiq::Shutdown exception at 60-second mark
  • Can catch this exception and handle gracefully:
    rescue Sidekiq::Shutdown => e
      # Log the interruption
      # Save partial progress if possible
      # Re-enqueue for retry after deployment
      raise # Re-raise to mark job as failed for retry
    end
    

Jobs over 90 seconds:

  • Forcibly killed by systemd (SIGKILL)
  • No opportunity to handle gracefully
  • Solution: Break these into smaller jobs or use batch processing

Deployment Impact

Before These Changes

  • ❌ Jobs killed immediately or within 25 seconds
  • ❌ Frequent Sidekiq::Shutdown exceptions in Rollbar
  • ❌ Incomplete EDI synchronizations
  • ❌ Lost API responses

After These Changes

  • ✅ Jobs have 60 seconds to complete gracefully
  • ✅ Significantly fewer Sidekiq::Shutdown errors
  • ✅ API calls can complete before shutdown
  • ✅ Better data consistency

Next Deployment Steps

Required Actions

When you deploy next, the systemd service files will be regenerated with the new TimeoutStopSec setting automatically by Capistrano.

No manual intervention required - the changes are applied automatically during deployment.

Verification

After deployment, verify the configuration:

# SSH to production server
ssh deploy@chi-vultr-heatwave-util1

# Check systemd service timeout
systemctl cat sidekiq-heatwave-production-sidekiq.service | grep TimeoutStopSec
# Should show: TimeoutStopSec=90

# Check Sidekiq is running
systemctl status sidekiq-heatwave-production-sidekiq.service

# Monitor next deployment logs
tail -f /var/www/heatwave/shared/log/sidekiq.log

Monitor for Success

After deployment, check Rollbar for:

  • Expected: Significant reduction in Sidekiq::Shutdown errors
  • Monitor: Any jobs that still exceed 60 seconds (may need timeout adjustment)

Tuning Recommendations

If Jobs Still Fail (Exceed 60 Seconds)

Consider these approaches:

Option 1: Increase Timeout (Simple)

# config/initializers/sidekiq.rb
config[:timeout] = 120  # Increase to 2 minutes

# config/deploy/templates/sidekiq.service.capistrano.erb
TimeoutStopSec=150  # Must be longer than Sidekiq timeout

# config/deploy.rb
set :sidekiq_timeout, 120

When to use: Jobs legitimately need more time to complete

Option 2: Break Into Smaller Jobs (Better)

# Instead of one long job:
def perform
  fetch_inventory    # 30s
  process_inventory  # 40s
  sync_to_database   # 30s
end

# Break into separate jobs:
FetchInventoryWorker.perform_async
ProcessInventoryWorker.perform_async
SyncInventoryWorker.perform_async

When to use: Jobs can be logically decomposed

Option 3: Handle Shutdown Gracefully (Best)

def perform
  begin
    long_running_operation
  rescue Sidekiq::Shutdown => e
    # Save checkpoint/progress
    store_partial_results
    
    # Re-enqueue with resume logic
    ResumeJobWorker.perform_in(30.seconds, checkpoint_id)
    
    # Re-raise to mark as interrupted
    raise
  end
end

When to use: Jobs can resume from a checkpoint

If Jobs Complete Too Quickly

Current timeout (60s) may be excessive if most jobs complete in < 10 seconds:

config[:timeout] = 30  # Faster restarts

Trade-off: Faster deployments vs. job completion safety

Configuration Reference

Current Settings

Setting Value Purpose
Sidekiq timeout 60s Job completion grace period
Systemd timeout 90s Service shutdown deadline
Capistrano timeout 60s Deployment wait time

Shutdown Signal Handling

Sidekiq responds to these signals:

  • TERM (default): Graceful shutdown with timeout
  • INT: Same as TERM
  • TSTP: Quiet mode (stop accepting new jobs, continue running)
  • TTIN: Print thread backtraces to log (debugging)
  • KILL: Immediate termination (no cleanup)

Troubleshooting

Jobs Still Getting Killed

Check systemd logs:

journalctl -u sidekiq-heatwave-production-sidekiq -n 100

Look for:

  • "Timeout during operation" - Systemd killed it (increase TimeoutStopSec)
  • "SIGTERM received" - Check if jobs are honoring timeout
  • "Forcing shutdown" - Jobs exceeded Sidekiq timeout

Deployments Taking Too Long

If deployments hang waiting for Sidekiq:

  1. Check for stuck jobs: bundle exec sidekiqctl busy
  2. Consider reducing timeout if jobs normally complete quickly
  3. Verify no infinite loops in worker code

Jobs Appearing as Failed

Sidekiq::Shutdown exceptions will appear as failures in Sidekiq retry queue:

  • Expected behavior for jobs exceeding timeout
  • Solution: Review job duration, break into smaller jobs, or increase timeout

Best Practices

Job Design for Graceful Shutdown

  1. Keep jobs short: Target < 30 seconds when possible
  2. Make jobs idempotent: Can safely retry without side effects
  3. Checkpoint progress: Save intermediate state for long jobs
  4. Handle interruptions:
    def perform
      begin
        work
      rescue Sidekiq::Shutdown
        cleanup_and_save_progress
        raise  # Allow Sidekiq to handle retry
      end
    end
    

Monitoring Recommendations

  1. Track job duration: Alert on jobs approaching timeout
  2. Monitor shutdown errors: Rollbar Sidekiq::Shutdown count
  3. Review retry queue: Jobs repeatedly interrupted may need redesign

References


Last Updated: October 9, 2025
Configuration Version: Sidekiq 7.3.9, Rails 7.0.8.7