Class: Retailer::RonaUrlDiscovery

Inherits:
Object
  • Object
show all
Defined in:
app/services/retailer/rona_url_discovery.rb

Overview

Discovers rona.ca product URLs (and seeds the scraped price/availability) for
WarmlyYours catalog items, seeded from rona's public XML sitemap.

Why the sitemap: rona.ca sits behind DataDome, which serves empty HTTP-200
pages for the on-site search to datacenter IPs — so the legacy search-based
discovery is dead. The sitemap (advertised in robots.txt) is bot-friendly and
NOT gated, giving the full universe of WarmlyYours product URLs for free. We
then fetch each candidate product page (universal source + JS render, via the
ASYNC endpoint — the realtime endpoint's 120s budget times out on rona) and
read the schema.org microdata:

# our manufacturer SKU

Matching the +mpn+ to our CatalogItem (by its item SKU) lets us store the
canonical product URL plus the authoritative scraped +retail_price+.

Coverage note: only ~22% of WarmlyYours rona URLs embed our SKU in the slug
(towel warmers, snow-melt cable); the TempZone heating lines use descriptive
slugs (e.g. "tempzone-3-ft-x-48-ft-240v-green-flex-roll"). Reading the +mpn+
off the rendered page is therefore the only reliable mapping for the full set.

rona's DataDome returns HTTP 403 for ~25% of product-page fetches, but the
full product HTML — including the schema.org price — is still in the body
(confirmed by Oxylabs), so we extract regardless of the status code. Geo is
country-level "Canada" (Oxylabs-recommended; consistent national pricing).

Examples:

Backfill every WarmlyYours rona item from the sitemap

Retailer::RonaUrlDiscovery.new.run

Preview without writing (and cap the candidate count)

Retailer::RonaUrlDiscovery.new.run(dry_run: true, limit: 10)

Constant Summary collapse

SITEMAP_INDEX =

rona's sitemap index (advertised in https://www.rona.ca/robots.txt).

'https://www.rona.ca/sitemap.xml'
CRAWLER_UA =

Googlebot-shaped UA — the sitemap + robots are served to crawlers, not
DataDome-challenged, so a plain HTTP GET suffices (no Oxylabs spend).

'Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)'
PRODUCT_SITEMAP_RE =

The gzipped English product sub-sitemaps inside the index.

%r{<loc>(https://www\.rona\.ca/sitemap-products-en[^<]*\.xml\.gz)</loc>}
WARMLY_URL_RE =

WarmlyYours product URLs (rona renders the brand as "warmlyyours" in slugs).

%r{<loc>(https://www\.rona\.ca/en/product/[^<]*warmly[^<]*)</loc>}i
MAX_RETRIES =

A genuinely empty body (Oxylabs fault / no HTML) is retried once; a 403 is
NOT retried — rona's DataDome still returns the full product HTML on a 403.

1
POLL_ATTEMPTS =

Async poll budget — rona's JS render can take well past the realtime 120s.

90
POLL_INTERVAL =
4

Instance Attribute Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(catalog_id: CatalogConstants::RONA_CANADA, logger: Rails.logger, api: nil) ⇒ RonaUrlDiscovery

Returns a new instance of RonaUrlDiscovery.

Parameters:

  • catalog_id (Integer) (defaults to: CatalogConstants::RONA_CANADA)

    the rona catalog (defaults to RONA_CANADA)

  • logger (Logger) (defaults to: Rails.logger)
  • api (Retailer::OxylabsApi, nil) (defaults to: nil)

    injectable for tests



63
64
65
66
67
68
# File 'app/services/retailer/rona_url_discovery.rb', line 63

def initialize(catalog_id: CatalogConstants::RONA_CANADA, logger: Rails.logger, api: nil)
  @catalog_id = catalog_id
  @logger = logger
  @api = api || Retailer::OxylabsApi.new(timeout: 60)
  @results = { candidates: 0, matched: 0, updated: 0, unmatched: 0, blocked: 0, errors: [] }
end

Instance Attribute Details

#resultsObject (readonly)

Returns the value of attribute results.



58
59
60
# File 'app/services/retailer/rona_url_discovery.rb', line 58

def results
  @results
end

Instance Method Details

#discover_page(url) ⇒ Hash?

Discover (URL, price) for one product page. Reused by the on-failure
rediscovery worker. Returns nil when the page is blocked or unparseable.

Parameters:

  • url (String)

Returns:

  • (Hash, nil)

    mpn:, price:, currency:, available:



113
114
115
116
117
118
119
# File 'app/services/retailer/rona_url_discovery.rb', line 113

def discover_page(url)
  content = fetch_product_page(url)
  return nil if content.blank?

  data = extract(content)
  data&.merge(url: url)
end

#discover_url(catalog_item) ⇒ String?

Re-find the product URL for a single item from the sitemap, with NO Oxylabs
spend — for the on-failure rediscovery path (Retailer::UrlRediscovery).
Matches by rona's product id (the stable trailing number in the item's
current URL, which survives slug renames), then falls back to a SKU-in-slug
match. Returns the URL or nil.

Parameters:

Returns:

  • (String, nil)


129
130
131
132
133
134
135
136
137
138
139
140
141
142
# File 'app/services/retailer/rona_url_discovery.rb', line 129

def discover_url(catalog_item)
  urls = sitemap_product_urls
  pid = catalog_item.url.to_s[%r{-(\d+)(?:[/?#]|$)}, 1]
  by_pid = pid && urls.find { |u| u.include?("-#{pid}") }
  return by_pid if by_pid

  sku = normalize(catalog_item.item&.sku)
  return nil if sku.blank?

  urls.find do |u|
    n = normalize(u)
    n.include?("-#{sku}-") || n.end_with?("-#{sku}")
  end
end

#run(dry_run: false, limit: nil) ⇒ Hash

Backfill product URLs (and prices) for every WarmlyYours rona item.

Parameters:

  • dry_run (Boolean) (defaults to: false)

    when true, log intended changes without writing

  • limit (Integer, nil) (defaults to: nil)

    cap candidate URLs processed (smoke tests)

Returns:

  • (Hash)

    results summary



75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
# File 'app/services/retailer/rona_url_discovery.rb', line 75

def run(dry_run: false, limit: nil)
  @dry_run = dry_run
  urls = sitemap_product_urls
  urls = urls.first(limit) if limit
  @results[:candidates] = urls.size
  items_by_sku = active_items_by_sku

  @logger.info "[RonaUrlDiscovery] #{urls.size} candidate URLs, #{items_by_sku.size} active items#{' [DRY RUN]' if dry_run}"

  urls.each_with_index do |url, index|
    process_candidate(url, items_by_sku)
    sleep(1) if index < urls.size - 1
  end

  @logger.info "[RonaUrlDiscovery] Done: #{@results.except(:errors)}"
  @results
end

#sitemap_product_urlsArray<String>

Fetch + parse the WarmlyYours product URLs from rona's sitemap. Public so the
failure-triggered rediscovery path and tests can reuse the candidate list.

Returns:

  • (Array<String>)

    absolute product URLs



97
98
99
100
101
102
103
104
105
106
# File 'app/services/retailer/rona_url_discovery.rb', line 97

def sitemap_product_urls
  index = http_get(SITEMAP_INDEX)
  index.scan(PRODUCT_SITEMAP_RE).flatten.flat_map do |gz_url|
    http_get(gz_url, gzip: true).scan(WARMLY_URL_RE).flatten
  end.uniq
rescue StandardError => e
  @logger.error "[RonaUrlDiscovery] sitemap fetch failed: #{e.class} #{e.message}"
  @results[:errors] << "sitemap: #{e.message}"
  []
end