Class: CatalogItemUrlWorker

Inherits:
Object
  • Object
show all
Includes:
Sidekiq::Job
Defined in:
app/workers/catalog_item_url_worker.rb

Overview

Sidekiq worker that validates a CatalogItem's external retailer URL.

Catalogs in the Oxylabs price-check rotation defer to the authoritative
retailer probe (see #perform); everything else gets a cheap Googlebot-shaped
HTTP GET and records +url_valid+ from the status.

Constant Summary collapse

HEADERS =

The following headers were extracted from a chrome inspector mimic as curl request

{
  'User-Agent': 'Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)'
}.freeze
FUNNEL_ENQUEUE_PAUSED_CATALOG_IDS =

Oxylabs-rotation catalogs whose probe is currently unreliable: we still
defer to the probe (skip the naive GET) but do NOT actively enqueue a new
one — it would just burn Oxylabs budget on failures. Remove a catalog here
once its probe is healthy again.
rona.ca re-enabled: it now probes via the universal source + JS render
against sitemap-discovered product URLs (Retailer::RonaUrlDiscovery), and a
stale URL self-heals via the on-failure rediscovery path
(Retailer::UrlRediscovery) before ProbeAutoSkipper would give up on it.

See Also:

  • Oxylabs follow-up (Basecamp todolist 498301955)
[
  CatalogConstants::LOWES_CANADA
].freeze

Instance Method Summary collapse

Instance Method Details

#http_get(url) ⇒ Faraday::Response

Fetches +url+ with the Googlebot-shaped User-Agent and a 10s open/read timeout.

Parameters:

  • url (String)

    the absolute URL to probe

Returns:

  • (Faraday::Response)


39
40
41
42
43
44
45
46
# File 'app/workers/catalog_item_url_worker.rb', line 39

def http_get(url)
  headers = HEADERS.transform_keys(&:to_s)
                   .merge(WebBotAuth::Signer.headers_for(url: url))
  Faraday.new(
    headers: headers,
    request: { open_timeout: 10, timeout: 10 }
  ) { |f| f.adapter Faraday.default_adapter }.get(url)
end

#perform(catalog_item_id) ⇒ Boolean?

Validates the catalog item's URL. Oxylabs-rotation catalogs are funnelled
through the authoritative retailer probe (the naive GET is skipped); other
catalogs get the naive GET, still deferring to any recent probe. Blank URLs
and skip_url_checks items are no-ops.

Parameters:

  • catalog_item_id (Integer)

    the CatalogItem to probe

Returns:

  • (Boolean, nil)

    +url_valid+ (true/false) after a completed naive GET;
    nil when funnelled to Oxylabs, skipped, or left inconclusive by a transport error



71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
# File 'app/workers/catalog_item_url_worker.rb', line 71

def perform(catalog_item_id)
  catalog_item = CatalogItem.where.not(skip_url_checks: true).find(catalog_item_id)
  return if catalog_item.url.blank?

  # Catalogs in the Oxylabs price-check rotation: the retailer probe (JS
  # render + rotating geo-IP) is authoritative for `url_valid`, and a naive
  # Googlebot-spoofed GET from our single server IP is doomed against their
  # anti-bot — it only produces Faraday transport-error noise on AppSignal
  # (#5565 / #5386). Skip the GET and funnel through Oxylabs instead.
  return funnel_through_oxylabs(catalog_item) if catalog_item.catalog&.external_price_check_enabled?

  # Other catalogs keep the cheap naive GET, but still defer to a recent
  # probe when one happens to exist (same authoritative-result reasoning:
  # if Oxylabs with JS + geo-IP couldn't reach the page, our single-IP
  # Googlebot GET won't either, so re-probing only adds AppSignal noise).
  if recent_probe?(catalog_item)
    logger.info "Skipping URL check for #{catalog_item.id} — recent probe exists (authoritative)"
    return
  end

  res = nil
  begin
    res = http_get(catalog_item.url)
    logger.info "#{catalog_item.url} result was #{res.status}"
  rescue Faraday::ConnectionFailed, Faraday::TimeoutError, Faraday::SSLError => e
    # Network-level failure — couldn't determine whether the page exists.
    # Don't flip url_valid to false; the URL might be fine and the next
    # probe will tell us. Log and bail without touching the row.
    # Intentionally don't re-raise — Sidekiq retries would just hit the
    # same flaky upstream without new info.
    #
    # Report as `informational` (background_info, "Never Notify"), not
    # `warning`: this worker probes ~thousands of third-party retailer URLs
    # daily, many of which block our Googlebot-spoofed User-Agent or briefly
    # fail DNS. A ~33/day baseline of transport failures is expected
    # operational noise, not actionable per-occurrence. AppSignal incidents
    # #5012 (ConnectionFailed) and #5013 (TimeoutError) were closed 8 times
    # between May 4 and May 23 each with notes saying "working as designed"
    # — they kept reopening because background_warning is configured "First
    # Occurrence" notifications. background_info is the right channel:
    # still searchable, still trended in the dashboard, no pages.
    logger.warn "Transport error retrieving #{catalog_item.url}: #{e.class} #{e.message}"
    ErrorReporting.informational(e, source: :background, catalog_item_id: catalog_item.id, url: catalog_item.url)
    return nil
  end

  valid = res&.status == 200
  catalog_item.update_columns(url_valid: valid, url_last_checked: Time.current)
  valid
end

#test_costcoInteger

Manual smoke-test helper: probe a known Costco product page. If http_get
works against Costco it will work against most anyone.

Returns:

  • (Integer)

    the HTTP status code



52
53
54
# File 'app/workers/catalog_item_url_worker.rb', line 52

def test_costco
  http_get('https://www.costco.ca/warmlyyours-riviera-towel-warmer.product.100802733.html').status
end

#test_walmartInteger

Manual smoke-test helper: probe a known Walmart product page.

Returns:

  • (Integer)

    the HTTP status code



59
60
61
# File 'app/workers/catalog_item_url_worker.rb', line 59

def test_walmart
  http_get('https://www.walmart.com/ip/Grande-10-Towel-Warmer-Black-Hardwired-10-Bars/595189227').status
end