Class: Retailer::Extractors::Base
- Inherits:
-
Object
- Object
- Retailer::Extractors::Base
- Includes:
- CatalogConstants
- Defined in:
- app/services/retailer/extractors/base.rb
Overview
Base class for retailer data extractors.
Uses Nokogiri for HTML parsing as recommended by Oxylabs:
https://github.com/oxylabs/webscraping-with-ruby
Direct Known Subclasses
Amazon, BestbuyCanada, BuildCom, CanadianTire, Costco, Generic, HomeDepot, Houzz, Lowes, Rona, Walmart, Wayfair
Constant Summary collapse
- RENDER_REQUIRED =
Whether this retailer requires JavaScript rendering for price extraction.
Override in subclasses to opt out (or override the constant value to false).render: 'html'is roughly 5x more expensive at Oxylabs than non-rendered
requests. Most extractors currently set true to preserve historical
behavior; flipping to false per-retailer should be done one at a time
alongside a manual probe to confirm the page still parses. true
Constants included from CatalogConstants
CatalogConstants::ALL_MAIN_CATALOG_IDS, CatalogConstants::AMAZON_CATALOG_IDS, CatalogConstants::AMAZON_CA_CATALOG_IDS, CatalogConstants::AMAZON_EU_CATALOG_IDS, CatalogConstants::AMAZON_NA_SELLER_IDS, CatalogConstants::AMAZON_SC_BE_CATALOG_ID, CatalogConstants::AMAZON_SC_CATALOG_IDS, CatalogConstants::AMAZON_SC_CA_CATALOG_ID, CatalogConstants::AMAZON_SC_DE_CATALOG_ID, CatalogConstants::AMAZON_SC_ES_CATALOG_ID, CatalogConstants::AMAZON_SC_FR_CATALOG_ID, CatalogConstants::AMAZON_SC_IT_CATALOG_ID, CatalogConstants::AMAZON_SC_NL_CATALOG_ID, CatalogConstants::AMAZON_SC_PL_CATALOG_ID, CatalogConstants::AMAZON_SC_SE_CATALOG_ID, CatalogConstants::AMAZON_SC_UK_CATALOG_ID, CatalogConstants::AMAZON_SC_US_CATALOG_ID, CatalogConstants::AMAZON_SELLER_IDS, CatalogConstants::AMAZON_US_CATALOG_IDS, CatalogConstants::AMAZON_VC_CATALOG_IDS, CatalogConstants::AMAZON_VC_CA_CATALOG_ID, CatalogConstants::AMAZON_VC_CA_CATALOG_IDS, CatalogConstants::AMAZON_VC_DIRECT_FULFILLMENT_CATALOG_IDS, CatalogConstants::AMAZON_VC_US_CATALOG_IDS, CatalogConstants::AMAZON_VC_US_WASN4_CATALOG_ID, CatalogConstants::AMAZON_VC_US_WAX7V_CATALOG_ID, CatalogConstants::AMAZON_VC_WAT0F_CA_CATALOG_ID, CatalogConstants::AMAZON_VC_WAT4D_CA_CATALOG_ID, CatalogConstants::AMAZON_VENDOR_CODE_TO_CATALOG_ID, CatalogConstants::BESTBUY_CANADA, CatalogConstants::BUILD_COM, CatalogConstants::CANADIAN_TIRE, CatalogConstants::CA_CATALOG_ID, CatalogConstants::COSTCO_CANADA, CatalogConstants::COSTCO_CATALOGS, CatalogConstants::COSTCO_USA, CatalogConstants::EU_CATALOG_ID, CatalogConstants::HOME_DEPOT_CANADA, CatalogConstants::HOME_DEPOT_CATALOGS, CatalogConstants::HOME_DEPOT_USA, CatalogConstants::HOUZZ, CatalogConstants::LOCALE_TO_CATALOG, CatalogConstants::LOWES_CANADA, CatalogConstants::LOWES_USA, CatalogConstants::RONA_CANADA, CatalogConstants::US_CATALOG_ID, CatalogConstants::WALMART_CATALOGS, CatalogConstants::WALMART_SELLER_CANADA, CatalogConstants::WALMART_SELLER_USA, CatalogConstants::WAYFAIR_CANADA, CatalogConstants::WAYFAIR_CATALOGS, CatalogConstants::WAYFAIR_GERMANY, CatalogConstants::WAYFAIR_USA
Instance Attribute Summary collapse
-
#catalog ⇒ Object
readonly
Returns the value of attribute catalog.
-
#discovered_url ⇒ String?
readonly
Returns a discovered direct URL if found during extraction.
Class Method Summary collapse
-
.render_value ⇒ String?
Returns the Oxylabs
renderpayload value for this extractor.
Instance Method Summary collapse
-
#catalog_base_url ⇒ String?
protected
Get base URL for this catalog's retailer Override in subclasses for specific domains.
-
#check_availability(html, unavailable_phrases = []) ⇒ Boolean
protected
Check if page indicates product is available.
-
#collect_product_identifiers(catalog_item) ⇒ Array<String>
Collect all product identifiers that can be used to validate the page.
-
#extract(check, content) ⇒ void
Extract data from content and populate the check record.
-
#extract_canonical_url(doc) ⇒ String?
protected
Extract canonical URL from page head or og:url meta tag.
-
#extract_json_ld_price(check, doc) ⇒ Object
protected
Extract price from JSON-LD structured data (schema.org) Most reliable method across retailers.
-
#extract_numeric_price(text) ⇒ Float?
protected
Extract numeric price from text.
-
#extract_price_from_selectors(check, doc, selectors) ⇒ Object
protected
Extract price from common CSS selectors.
-
#extract_product_link_from_search(doc, selectors) ⇒ String?
protected
Extract product link from search results.
-
#extract_title(doc) ⇒ String?
protected
Extract title from common selectors.
-
#initialize(catalog) ⇒ Base
constructor
A new instance of Base.
-
#make_absolute_url(href) ⇒ String
protected
Convert relative URL to absolute.
-
#parse_html(html) ⇒ Nokogiri::HTML::Document
protected
Parse HTML content with Nokogiri.
-
#source_name ⇒ String
Identifier for this extractor (used in check.scraper_source).
-
#valid_html?(content) ⇒ Boolean
protected
Validate that content is HTML string.
-
#valid_price?(price) ⇒ Boolean
protected
Validate that a price is reasonable.
-
#validate_product_identity(check, content, catalog_item) ⇒ Boolean
Validate that the scraped page actually contains our product identifiers.
Methods included from CatalogConstants
amazon_catalog?, amazon_seller_catalog?, costco_catalog?, home_depot_catalog?, walmart_catalog?, wayfair_catalog?
Constructor Details
#initialize(catalog) ⇒ Base
Returns a new instance of Base.
36 37 38 |
# File 'app/services/retailer/extractors/base.rb', line 36 def initialize(catalog) @catalog = catalog end |
Instance Attribute Details
#catalog ⇒ Object (readonly)
Returns the value of attribute catalog.
34 35 36 |
# File 'app/services/retailer/extractors/base.rb', line 34 def catalog @catalog end |
#discovered_url ⇒ String? (readonly)
Returns a discovered direct URL if found during extraction.
Used to capture and store canonical URLs for future direct access.
Override in subclasses that can discover URLs (e.g., from search results).
58 59 60 |
# File 'app/services/retailer/extractors/base.rb', line 58 def discovered_url @discovered_url end |
Class Method Details
.render_value ⇒ String?
Returns the Oxylabs render payload value for this extractor.
30 31 32 |
# File 'app/services/retailer/extractors/base.rb', line 30 def self.render_value self::RENDER_REQUIRED ? 'html' : nil end |
Instance Method Details
#catalog_base_url ⇒ String? (protected)
Get base URL for this catalog's retailer
Override in subclasses for specific domains
118 119 120 |
# File 'app/services/retailer/extractors/base.rb', line 118 def catalog_base_url nil end |
#check_availability(html, unavailable_phrases = []) ⇒ Boolean (protected)
Check if page indicates product is available
209 210 211 212 213 214 |
# File 'app/services/retailer/extractors/base.rb', line 209 def check_availability(html, unavailable_phrases = []) default_phrases = ['Out of Stock', 'Sold Out', 'Currently Unavailable', 'Not Available'] phrases = unavailable_phrases + default_phrases phrases.none? { |phrase| html.include?(phrase) } end |
#collect_product_identifiers(catalog_item) ⇒ Array<String>
Collect all product identifiers that can be used to validate the page
263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 |
# File 'app/services/retailer/extractors/base.rb', line 263 def collect_product_identifiers(catalog_item) identifiers = [] # Our internal SKU identifiers << catalog_item.sku # UPC from the parent item identifiers << catalog_item.store_item&.item&.upc # Third party number (retailer's part number) identifiers << catalog_item.third_party_part_number # Third party SKU (retailer-assigned / our marketplace SKU) identifiers << catalog_item.third_party_sku # Variant selector — the Wayfair piid that pins the exact size on a shared PDP identifiers << catalog_item.third_party_sku_variant_id # Parent SKU (e.g., WRM1245 for Wayfair variants) # This is used in search URLs and should appear on the page identifiers << catalog_item.parent_sku identifiers.compact.compact_blank.uniq end |
#extract(check, content) ⇒ void
This method returns an undefined value.
Extract data from content and populate the check record
44 45 46 |
# File 'app/services/retailer/extractors/base.rb', line 44 def extract(check, content) raise NotImplementedError, 'Subclasses must implement #extract' end |
#extract_canonical_url(doc) ⇒ String? (protected)
Extract canonical URL from page head or og:url meta tag
65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 |
# File 'app/services/retailer/extractors/base.rb', line 65 def extract_canonical_url(doc) # Try rel="canonical" first (most reliable) canonical_el = doc.at_css('link[rel="canonical"]') if canonical_el url = canonical_el['href'] return url if url.present? && url.start_with?('http') end # Try og:url meta tag og_url = doc.at_css('meta[property="og:url"]') if og_url url = og_url['content'] return url if url.present? && url.start_with?('http') end nil end |
#extract_json_ld_price(check, doc) ⇒ Object (protected)
Extract price from JSON-LD structured data (schema.org)
Most reliable method across retailers
140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 |
# File 'app/services/retailer/extractors/base.rb', line 140 def extract_json_ld_price(check, doc) doc.css('script[type="application/ld+json"]').each do |script| data = JSON.parse(script.text) # Handle @graph structure data = data['@graph'].find { |item| item['offers'] } || data if data['@graph'].is_a?(Array) offers = data['offers'] next unless offers # Handle array of offers offers = offers.first if offers.is_a?(Array) price = offers['price'] check.price = extract_numeric_price(price) if price # Also try to get regular/high price high_price = offers['highPrice'] || data['highPrice'] if high_price && check.price.present? regular = extract_numeric_price(high_price) check.regular_price = regular if regular && regular > check.price end break if check.price.present? rescue JSON::ParserError next end end |
#extract_numeric_price(text) ⇒ Float? (protected)
Extract numeric price from text
172 173 174 175 176 177 178 179 |
# File 'app/services/retailer/extractors/base.rb', line 172 def extract_numeric_price(text) price_val = if text.is_a?(Numeric) text.to_f else text.to_s.delete('^0-9.').to_f end price_val if valid_price?(price_val) end |
#extract_price_from_selectors(check, doc, selectors) ⇒ Object (protected)
Extract price from common CSS selectors
192 193 194 195 196 197 198 199 200 201 202 203 |
# File 'app/services/retailer/extractors/base.rb', line 192 def extract_price_from_selectors(check, doc, selectors) selectors.each do |selector| el = doc.at_css(selector) next unless el price_val = extract_numeric_price(el['content'] || el.text) if valid_price?(price_val) check.price = price_val break end end end |
#extract_product_link_from_search(doc, selectors) ⇒ String? (protected)
Extract product link from search results
87 88 89 90 91 92 93 94 95 96 97 98 99 |
# File 'app/services/retailer/extractors/base.rb', line 87 def extract_product_link_from_search(doc, selectors) selectors.each do |selector| link = doc.at_css(selector) next unless link href = link['href'] next if href.blank? # Make absolute URL if relative return make_absolute_url(href) if href.present? end nil end |
#extract_title(doc) ⇒ String? (protected)
Extract title from common selectors
219 220 221 222 |
# File 'app/services/retailer/extractors/base.rb', line 219 def extract_title(doc) title_el = doc.at_css('h1') || doc.at_css('[data-testid="product-title"]') title_el&.text&.strip&.truncate(255) end |
#make_absolute_url(href) ⇒ String (protected)
Convert relative URL to absolute
104 105 106 107 108 109 110 111 112 113 |
# File 'app/services/retailer/extractors/base.rb', line 104 def make_absolute_url(href) return href if href.start_with?('http') base_url = catalog_base_url return href unless base_url URI.join(base_url, href).to_s rescue URI::InvalidURIError href end |
#parse_html(html) ⇒ Nokogiri::HTML::Document (protected)
Parse HTML content with Nokogiri
125 126 127 |
# File 'app/services/retailer/extractors/base.rb', line 125 def parse_html(html) Nokogiri::HTML(html) end |
#source_name ⇒ String
Identifier for this extractor (used in check.scraper_source)
50 51 52 |
# File 'app/services/retailer/extractors/base.rb', line 50 def source_name self.class.name.demodulize.underscore end |
#valid_html?(content) ⇒ Boolean (protected)
Validate that content is HTML string
132 133 134 |
# File 'app/services/retailer/extractors/base.rb', line 132 def valid_html?(content) content.is_a?(String) && content.present? end |
#valid_price?(price) ⇒ Boolean (protected)
Validate that a price is reasonable
184 185 186 |
# File 'app/services/retailer/extractors/base.rb', line 184 def valid_price?(price) price.present? && price > 1 && price < 100_000 end |
#validate_product_identity(check, content, catalog_item) ⇒ Boolean
Validate that the scraped page actually contains our product identifiers.
This prevents false positives where a retailer redirects to a different product.
236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 |
# File 'app/services/retailer/extractors/base.rb', line 236 def validate_product_identity(check, content, catalog_item) identifiers = collect_product_identifiers(catalog_item) return true if identifiers.empty? # Skip validation if no identifiers available # Check if ANY of our identifiers appear in the page content or URL content_to_check = content.to_s.downcase url_to_check = check.url.to_s.downcase found = identifiers.any? do |identifier| next false if identifier.blank? normalized = identifier.to_s.downcase.strip content_to_check.include?(normalized) || url_to_check.include?(normalized) end unless found check.status = 'product_mismatch' check. = "Product identity validation failed: none of our identifiers (#{identifiers.compact.join(', ')}) found on page" Rails.logger.warn "[#{source_name}] Product mismatch for catalog_item #{catalog_item.id}: #{check.}" end found end |