Class: Retailer::RonaUrlDiscovery
- Inherits:
-
Object
- Object
- Retailer::RonaUrlDiscovery
- Defined in:
- app/services/retailer/rona_url_discovery.rb
Overview
Discovers rona.ca product URLs (and seeds the scraped price/availability) for
WarmlyYours catalog items, seeded from rona's public XML sitemap.
Why the sitemap: rona.ca sits behind DataDome, which serves empty HTTP-200
pages for the on-site search to datacenter IPs — so the legacy search-based
discovery is dead. The sitemap (advertised in robots.txt) is bot-friendly and
NOT gated, giving the full universe of WarmlyYours product URLs for free. We
then fetch each candidate product page (universal source + JS render, via the
ASYNC endpoint — the realtime endpoint's 120s budget times out on rona) and
read the schema.org microdata:
# our manufacturer SKU
Matching the +mpn+ to our CatalogItem (by its item SKU) lets us store the
canonical product URL plus the authoritative scraped +retail_price+.
Coverage note: only ~22% of WarmlyYours rona URLs embed our SKU in the slug
(towel warmers, snow-melt cable); the TempZone heating lines use descriptive
slugs (e.g. "tempzone-3-ft-x-48-ft-240v-green-flex-roll"). Reading the +mpn+
off the rendered page is therefore the only reliable mapping for the full set.
rona's DataDome returns HTTP 403 for ~25% of product-page fetches, but the
full product HTML — including the schema.org price — is still in the body
(confirmed by Oxylabs), so we extract regardless of the status code. Geo is
country-level "Canada" (Oxylabs-recommended; consistent national pricing).
Constant Summary collapse
- SITEMAP_INDEX =
rona's sitemap index (advertised in https://www.rona.ca/robots.txt).
'https://www.rona.ca/sitemap.xml'- CRAWLER_UA =
Googlebot-shaped UA — the sitemap + robots are served to crawlers, not
DataDome-challenged, so a plain HTTP GET suffices (no Oxylabs spend). 'Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)'- PRODUCT_SITEMAP_RE =
The gzipped English product sub-sitemaps inside the index.
%r{<loc>(https://www\.rona\.ca/sitemap-products-en[^<]*\.xml\.gz)</loc>}- WARMLY_URL_RE =
WarmlyYours product URLs (rona renders the brand as "warmlyyours" in slugs).
%r{<loc>(https://www\.rona\.ca/en/product/[^<]*warmly[^<]*)</loc>}i- MAX_RETRIES =
A genuinely empty body (Oxylabs fault / no HTML) is retried once; a 403 is
NOT retried — rona's DataDome still returns the full product HTML on a 403. 1- POLL_ATTEMPTS =
Async poll budget — rona's JS render can take well past the realtime 120s.
90- POLL_INTERVAL =
4
Instance Attribute Summary collapse
-
#results ⇒ Object
readonly
Returns the value of attribute results.
Instance Method Summary collapse
-
#discover_page(url) ⇒ Hash?
Discover (URL, price) for one product page.
-
#discover_url(catalog_item) ⇒ String?
Re-find the product URL for a single item from the sitemap, with NO Oxylabs spend — for the on-failure rediscovery path (Retailer::UrlRediscovery).
-
#initialize(catalog_id: CatalogConstants::RONA_CANADA, logger: Rails.logger, api: nil) ⇒ RonaUrlDiscovery
constructor
A new instance of RonaUrlDiscovery.
-
#run(dry_run: false, limit: nil) ⇒ Hash
Backfill product URLs (and prices) for every WarmlyYours rona item.
-
#sitemap_product_urls ⇒ Array<String>
Fetch + parse the WarmlyYours product URLs from rona's sitemap.
Constructor Details
#initialize(catalog_id: CatalogConstants::RONA_CANADA, logger: Rails.logger, api: nil) ⇒ RonaUrlDiscovery
Returns a new instance of RonaUrlDiscovery.
63 64 65 66 67 68 |
# File 'app/services/retailer/rona_url_discovery.rb', line 63 def initialize(catalog_id: CatalogConstants::RONA_CANADA, logger: Rails.logger, api: nil) @catalog_id = catalog_id @logger = logger @api = api || Retailer::OxylabsApi.new(timeout: 60) @results = { candidates: 0, matched: 0, updated: 0, unmatched: 0, blocked: 0, errors: [] } end |
Instance Attribute Details
#results ⇒ Object (readonly)
Returns the value of attribute results.
58 59 60 |
# File 'app/services/retailer/rona_url_discovery.rb', line 58 def results @results end |
Instance Method Details
#discover_page(url) ⇒ Hash?
Discover (URL, price) for one product page. Reused by the on-failure
rediscovery worker. Returns nil when the page is blocked or unparseable.
113 114 115 116 117 118 119 |
# File 'app/services/retailer/rona_url_discovery.rb', line 113 def discover_page(url) content = fetch_product_page(url) return nil if content.blank? data = extract(content) data&.merge(url: url) end |
#discover_url(catalog_item) ⇒ String?
Re-find the product URL for a single item from the sitemap, with NO Oxylabs
spend — for the on-failure rediscovery path (Retailer::UrlRediscovery).
Matches by rona's product id (the stable trailing number in the item's
current URL, which survives slug renames), then falls back to a SKU-in-slug
match. Returns the URL or nil.
129 130 131 132 133 134 135 136 137 138 139 140 141 142 |
# File 'app/services/retailer/rona_url_discovery.rb', line 129 def discover_url(catalog_item) urls = sitemap_product_urls pid = catalog_item.url.to_s[%r{-(\d+)(?:[/?#]|$)}, 1] by_pid = pid && urls.find { |u| u.include?("-#{pid}") } return by_pid if by_pid sku = normalize(catalog_item.item&.sku) return nil if sku.blank? urls.find do |u| n = normalize(u) n.include?("-#{sku}-") || n.end_with?("-#{sku}") end end |
#run(dry_run: false, limit: nil) ⇒ Hash
Backfill product URLs (and prices) for every WarmlyYours rona item.
75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 |
# File 'app/services/retailer/rona_url_discovery.rb', line 75 def run(dry_run: false, limit: nil) @dry_run = dry_run urls = sitemap_product_urls urls = urls.first(limit) if limit @results[:candidates] = urls.size items_by_sku = active_items_by_sku @logger.info "[RonaUrlDiscovery] #{urls.size} candidate URLs, #{items_by_sku.size} active items#{' [DRY RUN]' if dry_run}" urls.each_with_index do |url, index| process_candidate(url, items_by_sku) sleep(1) if index < urls.size - 1 end @logger.info "[RonaUrlDiscovery] Done: #{@results.except(:errors)}" @results end |
#sitemap_product_urls ⇒ Array<String>
Fetch + parse the WarmlyYours product URLs from rona's sitemap. Public so the
failure-triggered rediscovery path and tests can reuse the candidate list.
97 98 99 100 101 102 103 104 105 106 |
# File 'app/services/retailer/rona_url_discovery.rb', line 97 def sitemap_product_urls index = http_get(SITEMAP_INDEX) index.scan(PRODUCT_SITEMAP_RE).flatten.flat_map do |gz_url| http_get(gz_url, gzip: true).scan(WARMLY_URL_RE).flatten end.uniq rescue StandardError => e @logger.error "[RonaUrlDiscovery] sitemap fetch failed: #{e.class} #{e.}" @results[:errors] << "sitemap: #{e.}" [] end |