Incident Post-Mortem: Passenger Crash Loop — February 20, 2026

Status: Resolved
Severity: P1 — Full application outage (intermittent, recurring)
Duration: ~5 hours (approx. 10:00–15:00 CST)
Affected systems: App server (chi-vultr-heatwave-web1), Util/Sidekiq server (chi-vultr-heatwave-util1)
Author: Engineering (AI-assisted RCA)

Summary

Following a series of deployments on February 19–20, 2026 — primarily the new 3D bin-packing algorithm for packaging, ShipEngine concurrency improvements, and N+1 query fixes — Phusion Passenger on the web server began crash-looping. Workers grew beyond the configured passenger_memory_limit, which triggered a known segfault bug in Passenger Enterprise 6.1.1, taking down the entire Nginx process tree repeatedly. Simultaneously, all four Sidekiq processes on the util server entered a 244-restart crash loop due to a missing explicit require for the sidekiq-worker-killer gem.

Timeline

Time (CST)	Event
~Feb 19	`shipengine_rb` updated: Faraday retry logic + `ConcurrentRails::Promises` parallel label/rate calls deployed
~Feb 20 09:00	Packaging overhaul deployed: `DeterminePackaging` now runs `PackingCalculator` in hot path; N+1 eager-loads for `inventory_commits`, `catalog_item`, `store_item`
~10:00	Users report slowness; Passenger shows queued requests in APM
~10:15	Passenger workers begin reaching 1.2–1.5 GB RSS and being killed by `passenger_memory_limit`
~10:20	Nginx crash loop begins; kernel OOM killer eventually takes down master process
~10:30	Investigation starts; AppSignal shows rapid RSS growth per worker
~11:00	Root cause #1 identified: O(n³) fallback in `PackingCalculator#catalog_box_for` removed
~11:30	`passenger_memory_limit` adjusted (900 → 1200 → 1500 → 2000 MB over several iterations)
~12:00	`passenger_thread_count` reduced from 6 → 3; `MALLOC_ARENA_MAX=2` added
~12:30	`google-ads-googleads` gem set to `require: false` (saves ~150–250 MB per worker)
~13:00	Passenger 6.1.1 segfault confirmed via `dmesg`; upgrade to 6.1.2 initiated
~14:00	Passenger 6.1.2 activated via `sudo systemctl restart nginx`; segfaults stop
~14:30	Deployment of all fixes; Sidekiq enters crash loop (244 restarts)
~15:14	Root cause identified: `Sidekiq::WorkerKiller` uninitialized constant due to Bundler auto-require failure
~15:20	Server patched with explicit `require 'sidekiq/worker_killer'`; all Sidekiq services running
~15:22	App server confirmed stable: 0 queue, 6 workers healthy, no segfaults, swap near-zero

Root Causes

There were six independent contributing factors, each of which alone was manageable, but their combination drove workers past the memory limit and triggered a Passenger bug that turned individual worker kills into full-service crashes.

RC-1: O(n³) unconstrained box search in `PackingCalculator` (PRIMARY)

File: app/services/shipping/packing_calculator.rb

The #catalog_box_for method had a two-stage fallback:

# Before fix
def catalog_box_for(dims)
  Item::ShippingBoxCalculator.call(dims, candidate_boxes: @catalog) ||
    Item::ShippingBoxCalculator.call(dims)   # ← unconstrained fallback
end

When the catalog lookup returned nil, it called Item::ShippingBoxCalculator.call(dims) without candidate_boxes:, which triggered a mathematical search over all possible box dimensions. This is an O(n³) or worse operation that allocates thousands of intermediate Ruby objects per invocation. On any delivery with an unusual item dimension, a single request could spike a worker’s RSS by 300–500 MB in milliseconds.

Fix: Remove the unconstrained fallback entirely. If the catalog cannot find a fit, the calculator returns no solution and the caller falls through to the next strategy.

# After fix
def catalog_box_for(dims)
  Item::ShippingBoxCalculator.call(dims, candidate_boxes: @catalog)
end

RC-2: `DeterminePackaging` put `PackingCalculator` in the hot path for all deliveries

File: app/services/shipping/determine_packaging.rb (commit f14e53f2f8)

Previously, PackingCalculator was only invoked when no Packing history record existed. After this commit it was invoked as the first strategy for all parcel deliveries without existing packing records — which was most deliveries on first deployment since the from_calculator enum value was brand new (only 1 record in the DB at the time of deployment).

This meant the expensive bin-packing algorithm ran on nearly every request for several hours post-deploy, before the Packing cache table populated, rather than being gradually introduced. This caused a sustained high allocation rate across all workers simultaneously.

Lesson: When introducing a new expensive code path, consider a gradual rollout or a flag to limit it to a percentage of traffic initially. Pre-seeding the cache before deploying the code that depends on it would also have prevented the cold-cache surge.

RC-3: N+1 eager-load fix significantly increased objects per request

File: app/services/shipping/determine_packaging.rb (commit dd28319379)

The N+1 query fix was correct — it eliminated 6 hotspot queries. However, the fix used includes(:inventory_commits, :catalog_item, :store_item) on line items, which hydrates significantly more ActiveRecord objects into memory per request. Before the fix, records were loaded lazily (and often not loaded at all); after, they were always loaded upfront.

On a delivery with 20 line items, this can mean loading 60–100 additional records per request, each holding Ruby object memory. This raised the baseline allocation per request and caused workers to accumulate dirty RSS pages faster.

Lesson: N+1 fixes are always correct, but always profile their memory impact alongside query impact. rack-mini-profiler or AppSignal custom instrumentation can capture both.

RC-4: ShipEngine concurrency changes increased peak memory and thread hold time

Files: app/services/shipping/shipengine_base.rb (commit 209beba90d), shipengine_rb gem (commit fcd3cf7d4b)

Two changes compounded each other:

ConcurrentRails::Promises.future was added to parallelize label PDF downloads and USPS/CanadaPost MPS rate calls. During a rate fetch, the worker holds 2–4 HTTP response payloads simultaneously in memory (instead of sequentially).
Faraday retry logic in shipengine_rb retries timed-out or failed connections up to 3 times with exponential backoff. A single label download failure now holds a Faraday response object + connection in memory for 10–30 seconds during retries.

Combined: a single labelling request that hits a ShipEngine timeout can temporarily spike a worker by 50–150 MB while retries are in-flight. With 3 thread workers per process, three simultaneous retry storms can spike a worker by 450 MB.

Lesson: When adding concurrency via futures/promises, calculate worst-case peak memory as (payload_size × concurrent_futures × retry_attempts) and ensure it fits within the memory budget before deploying.

RC-5: Passenger Enterprise 6.1.1 segfault bug (CRITICAL MULTIPLIER)

When a Passenger worker exceeded passenger_memory_limit, Passenger 6.1.1 attempted a graceful worker kill. A NULL pointer dereference bug in PassengerAgent (segfault at address 0x8) caused the entire Nginx process tree to crash — not just the over-limit worker. This turned what should have been routine worker recycling into a full service outage.

dmesg evidence:

PassengerAgent[xxx]: segfault at 8 ip ...+0x341 error 4 in PassengerAgent

This bug was consistently triggered when:

passenger_concurrency_model thread was in use (multi-threaded workers)
A worker was killed for exceeding passenger_memory_limit
Passenger version was exactly 6.1.1 Enterprise

Fix: Upgrade to Passenger Enterprise 6.1.2 (released shortly before the incident). The segfault is absent in 6.1.2.

Lesson: Subscribe to Phusion Passenger Enterprise release notes. Stay within one minor release of current. Test Passenger upgrades in staging before production.

RC-6: Sidekiq crash loop — Bundler auto-require naming mismatch

File: config/initializers/sidekiq.rb

After deploying the Sidekiq::WorkerKiller middleware configuration, all four Sidekiq services entered a continuous crash loop (244 restarts over ~45 minutes):

Error during initialization: uninitialized constant Sidekiq::WorkerKiller
config/initializers/sidekiq.rb:56

The sidekiq-worker-killer gem was correctly in the Gemfile and Gemfile.lock. However, Bundler’s auto-require mechanism converts gem names to file paths by replacing every hyphen with a slash: sidekiq-worker-killer → require 'sidekiq/worker/killer'. The actual file in the gem is at lib/sidekiq/worker_killer.rb (underscore, not a nested directory). Bundler silently fails to find the file, no error is raised, and the constant is never defined.

Verification:

# Auto-require: silently fails — constant undefined
Bundler.require(:default)
defined?(Sidekiq::WorkerKiller)  # => nil

# Explicit require: works correctly
require 'sidekiq/worker_killer'
defined?(Sidekiq::WorkerKiller)  # => "constant"

Fix: Add an explicit require at the top of the initializer:

require 'sidekiq/worker_killer'

Lesson: Gems where the hyphenated name does not map directly to an underscore filename under a single directory level will fail Bundler auto-require silently. Any time you reference a constant from a gem in an initializer, verify that Bundler.require actually defines it in isolation before deploying.

Contributing Factors (not root causes)

`google-ads-googleads` eagerly loaded

This 44 MB, 5,481-file gem was unconditionally required on app boot, adding ~150–250 MB of baseline RSS per worker. This didn’t cause the crash but consumed ~1.2–1.5 GB of total memory that could have been available for request handling.

Fix: Added require: false in Gemfile; the gem is now loaded on-demand only in the services that need it.

`MALLOC_ARENA_MAX` not set

The default glibc allocator creates up to 8 memory arenas per process on multi-threaded apps. With passenger_thread_count 6, each worker could have up to 8 arenas, causing heap fragmentation that retains dirty pages long after objects are freed. This inflated RSS measurements and caused workers to hit passenger_memory_limit sooner than their actual live object count warranted.

Fix: Added passenger_env_var MALLOC_ARENA_MAX 2; to the Nginx config, limiting glibc to 2 arenas per process.

`Restart=on-failure` in Sidekiq systemd units

Sidekiq::WorkerKiller sends SIGTERM to the Sidekiq process, which exits cleanly with code 0. The Restart=on-failure policy only restarts on non-zero exits, so WorkerKiller-triggered shutdowns would leave Sidekiq dead permanently.

Fix: Changed to Restart=always in config/deploy/templates/sidekiq.service.capistrano.erb.

Configuration Changes Made

Setting	Before	After	Rationale
`passenger_memory_limit`	1500 MB	2000 MB	Provide headroom above 1088 MB peak; prevent workers hitting the limit during normal operation
`passenger_thread_count`	6	3	Reduce concurrent allocations per worker; fewer threads = less simultaneous memory pressure
`passenger_env_var MALLOC_ARENA_MAX`	(unset)	`2`	Limit glibc heap fragmentation in multi-threaded workers
`google-ads-googleads`	`require: true` (default)	`require: false`	Save 150–250 MB per worker on boot

Permanent Code Fixes

File	Change
`app/services/shipping/packing_calculator.rb`	Removed O(n³) unconstrained `ShippingBoxCalculator` fallback
`config/initializers/sidekiq.rb`	Added explicit `require 'sidekiq/worker_killer'`
`config/deploy/templates/sidekiq.service.capistrano.erb`	Changed `Restart=on-failure` → `Restart=always`
`Gemfile`	Added `require: false` to `google-ads-googleads`

Memory Budget Analysis (16 GB server)

Understanding how memory is consumed helps size passenger_memory_limit correctly.

Component	Memory
OS + kernel	~500 MB
Nginx master + 4 workers	~160 MB
Passenger watchdog + core	~3 GB (shared CoW from preloader)
AppPreloader (preloader process)	~80 MB
Per HTTP worker — preloaded baseline (shared CoW)	~1.2 GB (but only ~80 MB dirty initially)
Per HTTP worker — dirty RSS growth per 100 requests	~80–120 MB
Per HTTP worker — plateau after 500–1000 requests	~900 MB–1.1 GB dirty RSS
ActionCable workers (2×)	~1 GB combined
Total at steady state (6 HTTP + 2 AC workers)	~7–8 GB

With a 15 GB physical server, 7–8 GB in use leaves ~7 GB for the buffer/cache and headroom for GC spikes. passenger_memory_limit 2000 ensures workers are recycled before they approach a level that would threaten the system.

Lessons Learned

Profile memory before deploying algorithm-heavy code. Run MemoryProfiler.report or use AppSignal’s heap profiling on any service that introduces new data structures or recursive search. The O(n³) fallback should have been caught in code review.
Warm caches before switching hot paths. When DeterminePackaging was changed to use PackingCalculator as the primary strategy, the Packing table had exactly 1 from_calculator record. A pre-deploy migration or background worker to seed the cache would have avoided the cold-cache surge.
Test Bundler auto-require for gems with compound names. Any gem where the hyphenated name doesn’t map cleanly to a single require path needs an explicit require in the code that uses it. The pattern to verify: bundle exec ruby -e "Bundler.require(:default); puts defined?(Gem::ConstantName)".
Keep Passenger Enterprise within one minor version of current. The 6.1.1 segfault was a known regression. A passenger --version check in the deploy script or a Renovate/Dependabot rule for the apt package would have flagged the upgrade.
Calculate concurrent memory budgets for futures/promises. When parallelizing HTTP calls with ConcurrentRails::Promises.future, the peak memory is payload_size × concurrent_futures × max_retries. Add this to the per-request budget before setting max_rss in WorkerKiller.
MALLOC_ARENA_MAX=2 should be the default on all Ruby app servers. This is a well-known tuning for glibc-based systems running multi-threaded Ruby. It should be part of the standard server provisioning playbook.
Set Restart=always (not on-failure) for Sidekiq. WorkerKiller graceful shutdown exits with code 0. on-failure silently leaves Sidekiq dead. always covers both crash and clean-exit scenarios.
Monitor Sidekiq restart counters. systemctl status heatwave_sidekiq_production* shows restart counter is at N. A counter above 5 within a short window should trigger a PagerDuty/AppSignal alert. At counter 244, the service had been looping for ~45 minutes undetected.

Prevention Checklist for Future Packaging/Memory-Heavy Deployments

Run MemoryProfiler.report on any new service that processes line items or performs combinatorial search
Check AppSignal process_rss trend in staging 24h before deploying to production
Verify Packing table pre-populated or add a feature flag for gradual rollout
For any new gem added to an initializer: verify Bundler.require(:default) defines the constant in isolation
For any concurrent futures added: document worst-case peak memory in the PR description
After deploy, watch passenger-memory-stats for 15 minutes on the first batch of workers

References

AppSignal incident #1371 (closed)
Passenger 6.1.2 release notes — NULL pointer dereference fix in PassengerAgent
config/deploy/templates/sidekiq.service.capistrano.erb — Restart policy
doc/deployment/SIDEKIQ_GRACEFUL_SHUTDOWN.md — WorkerKiller configuration guide
doc/features/SHIPPING_PACKAGING_ALGORITHM.md — PackingCalculator architecture