Sidekiq Graceful Shutdown Configuration
Problem
During deployments, Sidekiq workers were being forcibly terminated before jobs could complete, resulting in Sidekiq::Shutdown exceptions. This was particularly problematic for:
- Long-running EDI API calls (Amazon, etc.)
- HTTP requests with slow response times
- Jobs that take longer than the default 25-second timeout
Example Error:
Sidekiq::Shutdown at HTTP::Connection#read_headers!
Solution
We've implemented a three-layer graceful shutdown strategy:
1. Sidekiq Timeout Configuration
File: config/initializers/sidekiq.rb
config[:timeout] = 60
- What it does: Tells Sidekiq to wait up to 60 seconds for running jobs to complete before forcing shutdown
- Why 60 seconds: Accommodates typical EDI API call times, including retries and network delays
- How it works: When Sidekiq receives a TERM signal (during deployment), it:
- Stops accepting new jobs immediately
- Waits for running jobs to complete (up to 60 seconds)
- Only force-kills jobs that exceed this timeout
2. Systemd Service Timeout
File: config/deploy/templates/sidekiq.service.capistrano.erb
TimeoutStopSec=90
- What it does: Tells systemd to wait 90 seconds for Sidekiq to gracefully shut down
- Why 90 seconds: Must be longer than Sidekiq's timeout (60s) + buffer for cleanup (30s)
- How it works: If systemd doesn't receive confirmation of shutdown within 90 seconds, it will send SIGKILL
3. Capistrano Configuration
File: config/deploy.rb
set :sidekiq_timeout, 60
- What it does: Ensures Capistrano waits for Sidekiq to shut down properly before proceeding with deployment
- Why it matches: Should align with Sidekiq's internal timeout for consistency
How Graceful Shutdown Works
Normal Shutdown Flow (During Deployment)
- Capistrano triggers Sidekiq restart via
sidekiq:restart_noblock - Systemd sends SIGTERM to Sidekiq process
- Sidekiq enters shutdown mode:
- Stops fetching new jobs from Redis
- Marks itself as "quiet" (won't accept work)
- Waits for currently executing jobs to complete
- Jobs have 60 seconds to finish:
- Jobs that complete within 60s: ✅ Success, no errors
- Jobs exceeding 60s: ⚠️ Receive
Sidekiq::Shutdownexception
- Sidekiq exits cleanly after all jobs complete or timeout
- Systemd starts new Sidekiq process with updated code
- Capistrano continues deployment
Timeline Example
T+0s → Deployment starts, systemd sends SIGTERM
T+0s → Sidekiq stops accepting new jobs
T+0s → Running EDI job continues (waiting for API response)
T+45s → EDI job completes successfully ✅
T+46s → Sidekiq shuts down gracefully
T+47s → New Sidekiq process starts with updated code
What Happens to Long-Running Jobs?
Jobs under 60 seconds:
- Complete normally
- No exceptions raised
- Results saved successfully
Jobs over 60 seconds:
- Receive
Sidekiq::Shutdownexception at 60-second mark - Can catch this exception and handle gracefully:
rescue Sidekiq::Shutdown => e # Log the interruption # Save partial progress if possible # Re-enqueue for retry after deployment raise # Re-raise to mark job as failed for retry end
Jobs over 90 seconds:
- Forcibly killed by systemd (SIGKILL)
- No opportunity to handle gracefully
- Solution: Break these into smaller jobs or use batch processing
Deployment Impact
Before These Changes
- ❌ Jobs killed immediately or within 25 seconds
- ❌ Frequent
Sidekiq::Shutdownexceptions in Rollbar - ❌ Incomplete EDI synchronizations
- ❌ Lost API responses
After These Changes
- ✅ Jobs have 60 seconds to complete gracefully
- ✅ Significantly fewer
Sidekiq::Shutdownerrors - ✅ API calls can complete before shutdown
- ✅ Better data consistency
Next Deployment Steps
Required Actions
When you deploy next, the systemd service files will be regenerated with the new TimeoutStopSec setting automatically by Capistrano.
No manual intervention required - the changes are applied automatically during deployment.
Verification
After deployment, verify the configuration:
# SSH to production server
ssh deploy@chi-vultr-heatwave-util1
# Check systemd service timeout
systemctl cat sidekiq-heatwave-production-sidekiq.service | grep TimeoutStopSec
# Should show: TimeoutStopSec=90
# Check Sidekiq is running
systemctl status sidekiq-heatwave-production-sidekiq.service
# Monitor next deployment logs
tail -f /var/www/heatwave/shared/log/sidekiq.log
Monitor for Success
After deployment, check Rollbar for:
- Expected: Significant reduction in
Sidekiq::Shutdownerrors - Monitor: Any jobs that still exceed 60 seconds (may need timeout adjustment)
Tuning Recommendations
If Jobs Still Fail (Exceed 60 Seconds)
Consider these approaches:
Option 1: Increase Timeout (Simple)
# config/initializers/sidekiq.rb
config[:timeout] = 120 # Increase to 2 minutes
# config/deploy/templates/sidekiq.service.capistrano.erb
TimeoutStopSec=150 # Must be longer than Sidekiq timeout
# config/deploy.rb
set :sidekiq_timeout, 120
When to use: Jobs legitimately need more time to complete
Option 2: Break Into Smaller Jobs (Better)
# Instead of one long job:
def perform
fetch_inventory # 30s
process_inventory # 40s
sync_to_database # 30s
end
# Break into separate jobs:
FetchInventoryWorker.perform_async
ProcessInventoryWorker.perform_async
SyncInventoryWorker.perform_async
When to use: Jobs can be logically decomposed
Option 3: Handle Shutdown Gracefully (Best)
def perform
begin
long_running_operation
rescue Sidekiq::Shutdown => e
# Save checkpoint/progress
store_partial_results
# Re-enqueue with resume logic
ResumeJobWorker.perform_in(30.seconds, checkpoint_id)
# Re-raise to mark as interrupted
raise
end
end
When to use: Jobs can resume from a checkpoint
If Jobs Complete Too Quickly
Current timeout (60s) may be excessive if most jobs complete in < 10 seconds:
config[:timeout] = 30 # Faster restarts
Trade-off: Faster deployments vs. job completion safety
Configuration Reference
Current Settings
| Setting | Value | Purpose |
|---|---|---|
| Sidekiq timeout | 60s | Job completion grace period |
| Systemd timeout | 90s | Service shutdown deadline |
| Capistrano timeout | 60s | Deployment wait time |
Shutdown Signal Handling
Sidekiq responds to these signals:
- TERM (default): Graceful shutdown with timeout
- INT: Same as TERM
- TSTP: Quiet mode (stop accepting new jobs, continue running)
- TTIN: Print thread backtraces to log (debugging)
- KILL: Immediate termination (no cleanup)
Troubleshooting
Jobs Still Getting Killed
Check systemd logs:
journalctl -u sidekiq-heatwave-production-sidekiq -n 100
Look for:
- "Timeout during operation" - Systemd killed it (increase
TimeoutStopSec) - "SIGTERM received" - Check if jobs are honoring timeout
- "Forcing shutdown" - Jobs exceeded Sidekiq timeout
Deployments Taking Too Long
If deployments hang waiting for Sidekiq:
- Check for stuck jobs:
bundle exec sidekiqctl busy - Consider reducing timeout if jobs normally complete quickly
- Verify no infinite loops in worker code
Jobs Appearing as Failed
Sidekiq::Shutdown exceptions will appear as failures in Sidekiq retry queue:
- Expected behavior for jobs exceeding timeout
- Solution: Review job duration, break into smaller jobs, or increase timeout
Best Practices
Job Design for Graceful Shutdown
- Keep jobs short: Target < 30 seconds when possible
- Make jobs idempotent: Can safely retry without side effects
- Checkpoint progress: Save intermediate state for long jobs
- Handle interruptions:
def perform begin work rescue Sidekiq::Shutdown cleanup_and_save_progress raise # Allow Sidekiq to handle retry end end
Monitoring Recommendations
- Track job duration: Alert on jobs approaching timeout
- Monitor shutdown errors: Rollbar
Sidekiq::Shutdowncount - Review retry queue: Jobs repeatedly interrupted may need redesign
References
- Sidekiq Signals Documentation
- Sidekiq Deployment Best Practices
- systemd Service Configuration
- Capistrano-Sidekiq Documentation
Last Updated: October 9, 2025
Configuration Version: Sidekiq 7.3.9, Rails 7.0.8.7