Sidekiq Graceful Shutdown Configuration
Problem
Section titled “Problem”During deployments, Sidekiq workers were being forcibly terminated before jobs could complete, resulting in Sidekiq::Shutdown exceptions. This was particularly problematic for:
- Long-running EDI API calls (Amazon, etc.)
- HTTP requests with slow response times
- Jobs that take longer than the default 25-second timeout
Example Error:
Sidekiq::Shutdown at HTTP::Connection#read_headers!Solution
Section titled “Solution”We’ve implemented a three-layer graceful shutdown strategy:
1. Sidekiq Timeout Configuration
Section titled “1. Sidekiq Timeout Configuration”File: config/initializers/sidekiq.rb
config[:timeout] = 60- What it does: Tells Sidekiq to wait up to 60 seconds for running jobs to complete before forcing shutdown
- Why 60 seconds: Accommodates typical EDI API call times, including retries and network delays
- How it works: When Sidekiq receives a TERM signal (during deployment), it:
- Stops accepting new jobs immediately
- Waits for running jobs to complete (up to 60 seconds)
- Only force-kills jobs that exceed this timeout
2. Systemd Service Timeout
Section titled “2. Systemd Service Timeout”File: config/deploy/templates/sidekiq.service.capistrano.erb
TimeoutStopSec=90- What it does: Tells systemd to wait 90 seconds for Sidekiq to gracefully shut down
- Why 90 seconds: Must be longer than Sidekiq’s timeout (60s) + buffer for cleanup (30s)
- How it works: If systemd doesn’t receive confirmation of shutdown within 90 seconds, it will send SIGKILL
3. Capistrano Configuration
Section titled “3. Capistrano Configuration”File: config/deploy.rb
set :sidekiq_timeout, 60- What it does: Ensures Capistrano waits for Sidekiq to shut down properly before proceeding with deployment
- Why it matches: Should align with Sidekiq’s internal timeout for consistency
How Graceful Shutdown Works
Section titled “How Graceful Shutdown Works”Normal Shutdown Flow (During Deployment)
Section titled “Normal Shutdown Flow (During Deployment)”- Capistrano triggers Sidekiq restart via
sidekiq:restart_noblock - Systemd sends SIGTERM to Sidekiq process
- Sidekiq enters shutdown mode:
- Stops fetching new jobs from Redis
- Marks itself as “quiet” (won’t accept work)
- Waits for currently executing jobs to complete
- Jobs have 60 seconds to finish:
- Jobs that complete within 60s: ✅ Success, no errors
- Jobs exceeding 60s: ⚠️ Receive
Sidekiq::Shutdownexception
- Sidekiq exits cleanly after all jobs complete or timeout
- Systemd starts new Sidekiq process with updated code
- Capistrano continues deployment
Timeline Example
Section titled “Timeline Example”T+0s → Deployment starts, systemd sends SIGTERMT+0s → Sidekiq stops accepting new jobsT+0s → Running EDI job continues (waiting for API response)T+45s → EDI job completes successfully ✅T+46s → Sidekiq shuts down gracefullyT+47s → New Sidekiq process starts with updated codeWhat Happens to Long-Running Jobs?
Section titled “What Happens to Long-Running Jobs?”Jobs under 60 seconds:
- Complete normally
- No exceptions raised
- Results saved successfully
Jobs over 60 seconds:
- Receive
Sidekiq::Shutdownexception at 60-second mark - Can catch this exception and handle gracefully:
rescue Sidekiq::Shutdown => e# Log the interruption# Save partial progress if possible# Re-enqueue for retry after deploymentraise # Re-raise to mark job as failed for retryend
Jobs over 90 seconds:
- Forcibly killed by systemd (SIGKILL)
- No opportunity to handle gracefully
- Solution: Break these into smaller jobs or use batch processing
Deployment Impact
Section titled “Deployment Impact”Before These Changes
Section titled “Before These Changes”- ❌ Jobs killed immediately or within 25 seconds
- ❌ Frequent
Sidekiq::Shutdownexceptions in Rollbar - ❌ Incomplete EDI synchronizations
- ❌ Lost API responses
After These Changes
Section titled “After These Changes”- ✅ Jobs have 60 seconds to complete gracefully
- ✅ Significantly fewer
Sidekiq::Shutdownerrors - ✅ API calls can complete before shutdown
- ✅ Better data consistency
Next Deployment Steps
Section titled “Next Deployment Steps”Required Actions
Section titled “Required Actions”When you deploy next, the systemd service files will be regenerated with the new TimeoutStopSec setting automatically by Capistrano.
No manual intervention required - the changes are applied automatically during deployment.
Verification
Section titled “Verification”After deployment, verify the configuration:
# SSH to production serverssh deploy@chi-vultr-heatwave-util1
# Check systemd service timeoutsystemctl cat sidekiq-heatwave-production-sidekiq.service | grep TimeoutStopSec# Should show: TimeoutStopSec=90
# Check Sidekiq is runningsystemctl status sidekiq-heatwave-production-sidekiq.service
# Monitor next deployment logstail -f /var/www/heatwave/shared/log/sidekiq.logMonitor for Success
Section titled “Monitor for Success”After deployment, check Rollbar for:
- Expected: Significant reduction in
Sidekiq::Shutdownerrors - Monitor: Any jobs that still exceed 60 seconds (may need timeout adjustment)
Tuning Recommendations
Section titled “Tuning Recommendations”If Jobs Still Fail (Exceed 60 Seconds)
Section titled “If Jobs Still Fail (Exceed 60 Seconds)”Consider these approaches:
Option 1: Increase Timeout (Simple)
Section titled “Option 1: Increase Timeout (Simple)”config[:timeout] = 120 # Increase to 2 minutes
# config/deploy/templates/sidekiq.service.capistrano.erbTimeoutStopSec=150 # Must be longer than Sidekiq timeout
# config/deploy.rbset :sidekiq_timeout, 120When to use: Jobs legitimately need more time to complete
Option 2: Break Into Smaller Jobs (Better)
Section titled “Option 2: Break Into Smaller Jobs (Better)”# Instead of one long job:def perform fetch_inventory # 30s process_inventory # 40s sync_to_database # 30send
# Break into separate jobs:FetchInventoryWorker.perform_asyncProcessInventoryWorker.perform_asyncSyncInventoryWorker.perform_asyncWhen to use: Jobs can be logically decomposed
Option 3: Handle Shutdown Gracefully (Best)
Section titled “Option 3: Handle Shutdown Gracefully (Best)”def perform begin long_running_operation rescue Sidekiq::Shutdown => e # Save checkpoint/progress store_partial_results
# Re-enqueue with resume logic ResumeJobWorker.perform_in(30.seconds, checkpoint_id)
# Re-raise to mark as interrupted raise endendWhen to use: Jobs can resume from a checkpoint
If Jobs Complete Too Quickly
Section titled “If Jobs Complete Too Quickly”Current timeout (60s) may be excessive if most jobs complete in < 10 seconds:
config[:timeout] = 30 # Faster restartsTrade-off: Faster deployments vs. job completion safety
Configuration Reference
Section titled “Configuration Reference”Current Settings
Section titled “Current Settings”| Setting | Value | Purpose |
|---|---|---|
| Sidekiq timeout | 60s | Job completion grace period |
| Systemd timeout | 90s | Service shutdown deadline |
| Capistrano timeout | 60s | Deployment wait time |
Shutdown Signal Handling
Section titled “Shutdown Signal Handling”Sidekiq responds to these signals:
- TERM (default): Graceful shutdown with timeout
- INT: Same as TERM
- TSTP: Quiet mode (stop accepting new jobs, continue running)
- TTIN: Print thread backtraces to log (debugging)
- KILL: Immediate termination (no cleanup)
Troubleshooting
Section titled “Troubleshooting”Jobs Still Getting Killed
Section titled “Jobs Still Getting Killed”Check systemd logs:
journalctl -u sidekiq-heatwave-production-sidekiq -n 100Look for:
- “Timeout during operation” - Systemd killed it (increase
TimeoutStopSec) - “SIGTERM received” - Check if jobs are honoring timeout
- “Forcing shutdown” - Jobs exceeded Sidekiq timeout
Deployments Taking Too Long
Section titled “Deployments Taking Too Long”If deployments hang waiting for Sidekiq:
- Check for stuck jobs:
bundle exec sidekiqctl busy - Consider reducing timeout if jobs normally complete quickly
- Verify no infinite loops in worker code
Jobs Appearing as Failed
Section titled “Jobs Appearing as Failed”Sidekiq::Shutdown exceptions will appear as failures in Sidekiq retry queue:
- Expected behavior for jobs exceeding timeout
- Solution: Review job duration, break into smaller jobs, or increase timeout
Best Practices
Section titled “Best Practices”Job Design for Graceful Shutdown
Section titled “Job Design for Graceful Shutdown”- Keep jobs short: Target < 30 seconds when possible
- Make jobs idempotent: Can safely retry without side effects
- Checkpoint progress: Save intermediate state for long jobs
- Handle interruptions:
def performbeginworkrescue Sidekiq::Shutdowncleanup_and_save_progressraise # Allow Sidekiq to handle retryendend
Monitoring Recommendations
Section titled “Monitoring Recommendations”- Track job duration: Alert on jobs approaching timeout
- Monitor shutdown errors: Rollbar
Sidekiq::Shutdowncount - Review retry queue: Jobs repeatedly interrupted may need redesign
References
Section titled “References”- Sidekiq Signals Documentation
- Sidekiq Deployment Best Practices
- systemd Service Configuration
- Capistrano-Sidekiq Documentation
Last Updated: October 9, 2025 Configuration Version: Sidekiq 7.3.9, Rails 7.0.8.7