Class: CatalogItemUrlWorker
- Inherits:
-
Object
- Object
- CatalogItemUrlWorker
- Includes:
- Sidekiq::Job
- Defined in:
- app/workers/catalog_item_url_worker.rb
Overview
Sidekiq worker that validates a CatalogItem's external retailer URL.
Catalogs in the Oxylabs price-check rotation defer to the authoritative
retailer probe (see #perform); everything else gets a cheap Googlebot-shaped
HTTP GET and records +url_valid+ from the status.
Constant Summary collapse
- HEADERS =
The following headers were extracted from a chrome inspector mimic as curl request
{ 'User-Agent': 'Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)' }.freeze
- FUNNEL_ENQUEUE_PAUSED_CATALOG_IDS =
Oxylabs-rotation catalogs whose probe is currently unreliable: we still
defer to the probe (skip the naive GET) but do NOT actively enqueue a new
one — it would just burn Oxylabs budget on failures. Remove a catalog here
once its probe is healthy again.
rona.ca re-enabled: it now probes via the universal source + JS render
against sitemap-discovered product URLs (Retailer::RonaUrlDiscovery), and a
stale URL self-heals via the on-failure rediscovery path
(Retailer::UrlRediscovery) before ProbeAutoSkipper would give up on it. [ CatalogConstants::LOWES_CANADA ].freeze
Instance Method Summary collapse
-
#http_get(url) ⇒ Faraday::Response
Fetches +url+ with the Googlebot-shaped User-Agent and a 10s open/read timeout.
-
#perform(catalog_item_id) ⇒ Boolean?
Validates the catalog item's URL.
-
#test_costco ⇒ Integer
Manual smoke-test helper: probe a known Costco product page.
-
#test_walmart ⇒ Integer
Manual smoke-test helper: probe a known Walmart product page.
Instance Method Details
#http_get(url) ⇒ Faraday::Response
Fetches +url+ with the Googlebot-shaped User-Agent and a 10s open/read timeout.
39 40 41 42 43 44 45 46 |
# File 'app/workers/catalog_item_url_worker.rb', line 39 def http_get(url) headers = HEADERS.transform_keys(&:to_s) .merge(WebBotAuth::Signer.headers_for(url: url)) Faraday.new( headers: headers, request: { open_timeout: 10, timeout: 10 } ) { |f| f.adapter Faraday.default_adapter }.get(url) end |
#perform(catalog_item_id) ⇒ Boolean?
Validates the catalog item's URL. Oxylabs-rotation catalogs are funnelled
through the authoritative retailer probe (the naive GET is skipped); other
catalogs get the naive GET, still deferring to any recent probe. Blank URLs
and skip_url_checks items are no-ops.
71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 |
# File 'app/workers/catalog_item_url_worker.rb', line 71 def perform(catalog_item_id) catalog_item = CatalogItem.where.not(skip_url_checks: true).find(catalog_item_id) return if catalog_item.url.blank? # Catalogs in the Oxylabs price-check rotation: the retailer probe (JS # render + rotating geo-IP) is authoritative for `url_valid`, and a naive # Googlebot-spoofed GET from our single server IP is doomed against their # anti-bot — it only produces Faraday transport-error noise on AppSignal # (#5565 / #5386). Skip the GET and funnel through Oxylabs instead. return funnel_through_oxylabs(catalog_item) if catalog_item.catalog&.external_price_check_enabled? # Other catalogs keep the cheap naive GET, but still defer to a recent # probe when one happens to exist (same authoritative-result reasoning: # if Oxylabs with JS + geo-IP couldn't reach the page, our single-IP # Googlebot GET won't either, so re-probing only adds AppSignal noise). if recent_probe?(catalog_item) logger.info "Skipping URL check for #{catalog_item.id} — recent probe exists (authoritative)" return end res = nil begin res = http_get(catalog_item.url) logger.info "#{catalog_item.url} result was #{res.status}" rescue Faraday::ConnectionFailed, Faraday::TimeoutError, Faraday::SSLError => e # Network-level failure — couldn't determine whether the page exists. # Don't flip url_valid to false; the URL might be fine and the next # probe will tell us. Log and bail without touching the row. # Intentionally don't re-raise — Sidekiq retries would just hit the # same flaky upstream without new info. # # Report as `informational` (background_info, "Never Notify"), not # `warning`: this worker probes ~thousands of third-party retailer URLs # daily, many of which block our Googlebot-spoofed User-Agent or briefly # fail DNS. A ~33/day baseline of transport failures is expected # operational noise, not actionable per-occurrence. AppSignal incidents # #5012 (ConnectionFailed) and #5013 (TimeoutError) were closed 8 times # between May 4 and May 23 each with notes saying "working as designed" # — they kept reopening because background_warning is configured "First # Occurrence" notifications. background_info is the right channel: # still searchable, still trended in the dashboard, no pages. logger.warn "Transport error retrieving #{catalog_item.url}: #{e.class} #{e.}" ErrorReporting.informational(e, source: :background, catalog_item_id: catalog_item.id, url: catalog_item.url) return nil end valid = res&.status == 200 catalog_item.update_columns(url_valid: valid, url_last_checked: Time.current) valid end |
#test_costco ⇒ Integer
Manual smoke-test helper: probe a known Costco product page. If http_get
works against Costco it will work against most anyone.
52 53 54 |
# File 'app/workers/catalog_item_url_worker.rb', line 52 def test_costco http_get('https://www.costco.ca/warmlyyours-riviera-towel-warmer.product.100802733.html').status end |
#test_walmart ⇒ Integer
Manual smoke-test helper: probe a known Walmart product page.
59 60 61 |
# File 'app/workers/catalog_item_url_worker.rb', line 59 def test_walmart http_get('https://www.walmart.com/ip/Grande-10-Towel-Warmer-Black-Hardwired-10-Bars/595189227').status end |