Sidekiq Graceful Shutdown Configuration

Problem

During deployments, Sidekiq workers were being forcibly terminated before jobs could complete, resulting in Sidekiq::Shutdown exceptions. This was particularly problematic for:

Long-running EDI API calls (Amazon, etc.)
HTTP requests with slow response times
Jobs that take longer than the default 25-second timeout

Example Error:

Sidekiq::Shutdown at HTTP::Connection#read_headers!

Solution

We've implemented a three-layer graceful shutdown strategy:

1. Sidekiq Timeout Configuration

File: config/initializers/sidekiq.rb

config[:timeout] = 60

What it does: Tells Sidekiq to wait up to 60 seconds for running jobs to complete before forcing shutdown
Why 60 seconds: Accommodates typical EDI API call times, including retries and network delays
How it works: When Sidekiq receives a TERM signal (during deployment), it:
1. Stops accepting new jobs immediately
2. Waits for running jobs to complete (up to 60 seconds)
3. Only force-kills jobs that exceed this timeout

2. Systemd Service Timeout

File: config/deploy/templates/sidekiq.service.capistrano.erb

TimeoutStopSec=90

What it does: Tells systemd to wait 90 seconds for Sidekiq to gracefully shut down
Why 90 seconds: Must be longer than Sidekiq's timeout (60s) + buffer for cleanup (30s)
How it works: If systemd doesn't receive confirmation of shutdown within 90 seconds, it will send SIGKILL

3. Capistrano Configuration

File: config/deploy.rb

set :sidekiq_timeout, 60

What it does: Ensures Capistrano waits for Sidekiq to shut down properly before proceeding with deployment
Why it matches: Should align with Sidekiq's internal timeout for consistency

How Graceful Shutdown Works

Normal Shutdown Flow (During Deployment)

Capistrano triggers Sidekiq restart via sidekiq:restart_noblock
Systemd sends SIGTERM to Sidekiq process
Sidekiq enters shutdown mode:
- Stops fetching new jobs from Redis
- Marks itself as "quiet" (won't accept work)
- Waits for currently executing jobs to complete
Jobs have 60 seconds to finish:
- Jobs that complete within 60s: ✅ Success, no errors
- Jobs exceeding 60s: ⚠️ Receive Sidekiq::Shutdown exception
Sidekiq exits cleanly after all jobs complete or timeout
Systemd starts new Sidekiq process with updated code
Capistrano continues deployment

Timeline Example

T+0s   → Deployment starts, systemd sends SIGTERM
T+0s   → Sidekiq stops accepting new jobs
T+0s   → Running EDI job continues (waiting for API response)
T+45s  → EDI job completes successfully ✅
T+46s  → Sidekiq shuts down gracefully
T+47s  → New Sidekiq process starts with updated code

What Happens to Long-Running Jobs?

Jobs under 60 seconds:

Complete normally
No exceptions raised
Results saved successfully

Jobs over 60 seconds:

Receive Sidekiq::Shutdown exception at 60-second mark

Can catch this exception and handle gracefully:

rescue Sidekiq::Shutdown => e
  # Log the interruption
  # Save partial progress if possible
  # Re-enqueue for retry after deployment
  raise # Re-raise to mark job as failed for retry
end

Jobs over 90 seconds:

Forcibly killed by systemd (SIGKILL)
No opportunity to handle gracefully
Solution: Break these into smaller jobs or use batch processing

Deployment Impact

Before These Changes

❌ Jobs killed immediately or within 25 seconds
❌ Frequent Sidekiq::Shutdown exceptions in Rollbar
❌ Incomplete EDI synchronizations
❌ Lost API responses

After These Changes

✅ Jobs have 60 seconds to complete gracefully
✅ Significantly fewer Sidekiq::Shutdown errors
✅ API calls can complete before shutdown
✅ Better data consistency

Next Deployment Steps

Required Actions

When you deploy next, the systemd service files will be regenerated with the new TimeoutStopSec setting automatically by Capistrano.

No manual intervention required - the changes are applied automatically during deployment.

Verification

After deployment, verify the configuration:

# SSH to production server
ssh deploy@chi-vultr-heatwave-util1

# Check systemd service timeout
systemctl cat sidekiq-heatwave-production-sidekiq.service | grep TimeoutStopSec
# Should show: TimeoutStopSec=90

# Check Sidekiq is running
systemctl status sidekiq-heatwave-production-sidekiq.service

# Monitor next deployment logs
tail -f /var/www/heatwave/shared/log/sidekiq.log

Monitor for Success

After deployment, check Rollbar for:

Expected: Significant reduction in Sidekiq::Shutdown errors
Monitor: Any jobs that still exceed 60 seconds (may need timeout adjustment)

Tuning Recommendations

If Jobs Still Fail (Exceed 60 Seconds)

Consider these approaches:

Option 1: Increase Timeout (Simple)

# config/initializers/sidekiq.rb
config[:timeout] = 120  # Increase to 2 minutes

# config/deploy/templates/sidekiq.service.capistrano.erb
TimeoutStopSec=150  # Must be longer than Sidekiq timeout

# config/deploy.rb
set :sidekiq_timeout, 120

When to use: Jobs legitimately need more time to complete

Option 2: Break Into Smaller Jobs (Better)

# Instead of one long job:
def perform
  fetch_inventory    # 30s
  process_inventory  # 40s
  sync_to_database   # 30s
end

# Break into separate jobs:
FetchInventoryWorker.perform_async
ProcessInventoryWorker.perform_async
SyncInventoryWorker.perform_async

When to use: Jobs can be logically decomposed

Option 3: Handle Shutdown Gracefully (Best)

def perform
  begin
    long_running_operation
  rescue Sidekiq::Shutdown => e
    # Save checkpoint/progress
    store_partial_results
    
    # Re-enqueue with resume logic
    ResumeJobWorker.perform_in(30.seconds, checkpoint_id)
    
    # Re-raise to mark as interrupted
    raise
  end
end

When to use: Jobs can resume from a checkpoint

If Jobs Complete Too Quickly

Current timeout (60s) may be excessive if most jobs complete in < 10 seconds:

config[:timeout] = 30  # Faster restarts

Trade-off: Faster deployments vs. job completion safety

Configuration Reference

Current Settings

Setting	Value	Purpose
Sidekiq timeout	60s	Job completion grace period
Systemd timeout	90s	Service shutdown deadline
Capistrano timeout	60s	Deployment wait time

Shutdown Signal Handling

Sidekiq responds to these signals:

TERM (default): Graceful shutdown with timeout
INT: Same as TERM
TSTP: Quiet mode (stop accepting new jobs, continue running)
TTIN: Print thread backtraces to log (debugging)
KILL: Immediate termination (no cleanup)

Troubleshooting

Jobs Still Getting Killed

Check systemd logs:

journalctl -u sidekiq-heatwave-production-sidekiq -n 100

Look for:

"Timeout during operation" - Systemd killed it (increase TimeoutStopSec)
"SIGTERM received" - Check if jobs are honoring timeout
"Forcing shutdown" - Jobs exceeded Sidekiq timeout

Deployments Taking Too Long

If deployments hang waiting for Sidekiq:

Check for stuck jobs: bundle exec sidekiqctl busy
Consider reducing timeout if jobs normally complete quickly
Verify no infinite loops in worker code

Jobs Appearing as Failed

Sidekiq::Shutdown exceptions will appear as failures in Sidekiq retry queue:

Expected behavior for jobs exceeding timeout
Solution: Review job duration, break into smaller jobs, or increase timeout

Best Practices

Job Design for Graceful Shutdown

Keep jobs short: Target < 30 seconds when possible
Make jobs idempotent: Can safely retry without side effects
Checkpoint progress: Save intermediate state for long jobs

Handle interruptions:

def perform
  begin
    work
  rescue Sidekiq::Shutdown
    cleanup_and_save_progress
    raise  # Allow Sidekiq to handle retry
  end
end

Monitoring Recommendations

Track job duration: Alert on jobs approaching timeout
Monitor shutdown errors: Rollbar Sidekiq::Shutdown count
Review retry queue: Jobs repeatedly interrupted may need redesign

References

Last Updated: October 9, 2025
Configuration Version: Sidekiq 7.3.9, Rails 7.0.8.7