Class: Retailer::Extractors::Base
- Inherits:
-
Object
- Object
- Retailer::Extractors::Base
- Includes:
- CatalogConstants
- Defined in:
- app/services/retailer/extractors/base.rb
Overview
Base class for retailer data extractors.
Uses Nokogiri for HTML parsing as recommended by Oxylabs:
https://github.com/oxylabs/webscraping-with-ruby
Direct Known Subclasses
Amazon, BestbuyCanada, BuildCom, CanadianTire, Costco, Generic, HomeDepot, Houzz, Lowes, Rona, Walmart, Wayfair
Constant Summary
Constants included from CatalogConstants
CatalogConstants::ALL_MAIN_CATALOG_IDS, CatalogConstants::AMAZON_CATALOG_IDS, CatalogConstants::AMAZON_CA_CATALOG_IDS, CatalogConstants::AMAZON_EU_CATALOG_IDS, CatalogConstants::AMAZON_NA_SELLER_IDS, CatalogConstants::AMAZON_SC_BE_CATALOG_ID, CatalogConstants::AMAZON_SC_CATALOG_IDS, CatalogConstants::AMAZON_SC_CA_CATALOG_ID, CatalogConstants::AMAZON_SC_DE_CATALOG_ID, CatalogConstants::AMAZON_SC_ES_CATALOG_ID, CatalogConstants::AMAZON_SC_FR_CATALOG_ID, CatalogConstants::AMAZON_SC_IT_CATALOG_ID, CatalogConstants::AMAZON_SC_NL_CATALOG_ID, CatalogConstants::AMAZON_SC_PL_CATALOG_ID, CatalogConstants::AMAZON_SC_SE_CATALOG_ID, CatalogConstants::AMAZON_SC_UK_CATALOG_ID, CatalogConstants::AMAZON_SC_US_CATALOG_ID, CatalogConstants::AMAZON_SELLER_IDS, CatalogConstants::AMAZON_US_CATALOG_IDS, CatalogConstants::AMAZON_VC_CATALOG_IDS, CatalogConstants::AMAZON_VC_CA_CATALOG_ID, CatalogConstants::AMAZON_VC_CA_CATALOG_IDS, CatalogConstants::AMAZON_VC_DIRECT_FULFILLMENT_CATALOG_IDS, CatalogConstants::AMAZON_VC_US_CATALOG_IDS, CatalogConstants::AMAZON_VC_US_WASN4_CATALOG_ID, CatalogConstants::AMAZON_VC_US_WAX7V_CATALOG_ID, CatalogConstants::AMAZON_VC_WAT0F_CA_CATALOG_ID, CatalogConstants::AMAZON_VC_WAT4D_CA_CATALOG_ID, CatalogConstants::AMAZON_VENDOR_CODE_TO_CATALOG_ID, CatalogConstants::BESTBUY_CANADA, CatalogConstants::BUILD_COM, CatalogConstants::CANADIAN_TIRE, CatalogConstants::CA_CATALOG_ID, CatalogConstants::COSTCO_CANADA, CatalogConstants::COSTCO_CATALOGS, CatalogConstants::COSTCO_USA, CatalogConstants::EU_CATALOG_ID, CatalogConstants::HOME_DEPOT_CANADA, CatalogConstants::HOME_DEPOT_CATALOGS, CatalogConstants::HOME_DEPOT_USA, CatalogConstants::HOUZZ, CatalogConstants::LOCALE_TO_CATALOG, CatalogConstants::LOWES_CANADA, CatalogConstants::LOWES_USA, CatalogConstants::MARKETPLACE_CATALOGS, CatalogConstants::PRICE_CHECK_ENABLED_CATALOGS, CatalogConstants::RONA_CANADA, CatalogConstants::US_CATALOG_ID, CatalogConstants::VENDOR_CATALOGS, CatalogConstants::WALMART_CATALOGS, CatalogConstants::WALMART_SELLER_CANADA, CatalogConstants::WALMART_SELLER_USA, CatalogConstants::WAYFAIR_CANADA, CatalogConstants::WAYFAIR_CATALOGS, CatalogConstants::WAYFAIR_GERMANY, CatalogConstants::WAYFAIR_USA
Instance Attribute Summary collapse
-
#catalog ⇒ Object
readonly
Returns the value of attribute catalog.
-
#discovered_url ⇒ String?
readonly
Returns a discovered direct URL if found during extraction.
Instance Method Summary collapse
-
#catalog_base_url ⇒ String?
protected
Get base URL for this catalog's retailer Override in subclasses for specific domains.
-
#check_availability(html, unavailable_phrases = []) ⇒ Boolean
protected
Check if page indicates product is available.
-
#collect_product_identifiers(catalog_item) ⇒ Array<String>
Collect all product identifiers that can be used to validate the page.
-
#extract(check, content) ⇒ void
Extract data from content and populate the check record.
-
#extract_canonical_url(doc) ⇒ String?
protected
Extract canonical URL from page head or og:url meta tag.
-
#extract_json_ld_price(check, doc) ⇒ Object
protected
Extract price from JSON-LD structured data (schema.org) Most reliable method across retailers.
-
#extract_numeric_price(text) ⇒ Float?
protected
Extract numeric price from text.
-
#extract_price_from_selectors(check, doc, selectors) ⇒ Object
protected
Extract price from common CSS selectors.
-
#extract_product_link_from_search(doc, selectors) ⇒ String?
protected
Extract product link from search results.
-
#extract_title(doc) ⇒ String?
protected
Extract title from common selectors.
-
#initialize(catalog) ⇒ Base
constructor
A new instance of Base.
-
#make_absolute_url(href) ⇒ String
protected
Convert relative URL to absolute.
-
#parse_html(html) ⇒ Nokogiri::HTML::Document
protected
Parse HTML content with Nokogiri.
-
#source_name ⇒ String
Identifier for this extractor (used in check.scraper_source).
-
#valid_html?(content) ⇒ Boolean
protected
Validate that content is HTML string.
-
#valid_price?(price) ⇒ Boolean
protected
Validate that a price is reasonable.
-
#validate_product_identity(check, content, catalog_item) ⇒ Boolean
Validate that the scraped page actually contains our product identifiers.
Methods included from CatalogConstants
amazon_catalog?, amazon_seller_catalog?, costco_catalog?, home_depot_catalog?, marketplace_catalog?, price_check_enabled?, vendor_catalog?, walmart_catalog?, wayfair_catalog?
Constructor Details
#initialize(catalog) ⇒ Base
Returns a new instance of Base.
19 20 21 |
# File 'app/services/retailer/extractors/base.rb', line 19 def initialize(catalog) @catalog = catalog end |
Instance Attribute Details
#catalog ⇒ Object (readonly)
Returns the value of attribute catalog.
17 18 19 |
# File 'app/services/retailer/extractors/base.rb', line 17 def catalog @catalog end |
#discovered_url ⇒ String? (readonly)
Returns a discovered direct URL if found during extraction.
Used to capture and store canonical URLs for future direct access.
Override in subclasses that can discover URLs (e.g., from search results).
41 42 43 |
# File 'app/services/retailer/extractors/base.rb', line 41 def discovered_url @discovered_url end |
Instance Method Details
#catalog_base_url ⇒ String? (protected)
Get base URL for this catalog's retailer
Override in subclasses for specific domains
101 102 103 |
# File 'app/services/retailer/extractors/base.rb', line 101 def catalog_base_url nil end |
#check_availability(html, unavailable_phrases = []) ⇒ Boolean (protected)
Check if page indicates product is available
192 193 194 195 196 197 |
# File 'app/services/retailer/extractors/base.rb', line 192 def check_availability(html, unavailable_phrases = []) default_phrases = ['Out of Stock', 'Sold Out', 'Currently Unavailable', 'Not Available'] phrases = unavailable_phrases + default_phrases phrases.none? { |phrase| html.include?(phrase) } end |
#collect_product_identifiers(catalog_item) ⇒ Array<String>
Collect all product identifiers that can be used to validate the page
246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 |
# File 'app/services/retailer/extractors/base.rb', line 246 def collect_product_identifiers(catalog_item) identifiers = [] # Our internal SKU identifiers << catalog_item.sku # UPC from the parent item identifiers << catalog_item.store_item&.item&.upc # Third party number (retailer's part number) identifiers << catalog_item.third_party_part_number # Third party SKU (e.g., Wayfair piid) identifiers << catalog_item.third_party_sku # Parent SKU (e.g., WRM1245 for Wayfair variants) # This is used in search URLs and should appear on the page identifiers << catalog_item.parent_sku identifiers.compact.reject(&:blank?).uniq end |
#extract(check, content) ⇒ void
This method returns an undefined value.
Extract data from content and populate the check record
27 28 29 |
# File 'app/services/retailer/extractors/base.rb', line 27 def extract(check, content) raise NotImplementedError, 'Subclasses must implement #extract' end |
#extract_canonical_url(doc) ⇒ String? (protected)
Extract canonical URL from page head or og:url meta tag
48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 |
# File 'app/services/retailer/extractors/base.rb', line 48 def extract_canonical_url(doc) # Try rel="canonical" first (most reliable) canonical_el = doc.at_css('link[rel="canonical"]') if canonical_el url = canonical_el['href'] return url if url.present? && url.start_with?('http') end # Try og:url meta tag og_url = doc.at_css('meta[property="og:url"]') if og_url url = og_url['content'] return url if url.present? && url.start_with?('http') end nil end |
#extract_json_ld_price(check, doc) ⇒ Object (protected)
Extract price from JSON-LD structured data (schema.org)
Most reliable method across retailers
123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 |
# File 'app/services/retailer/extractors/base.rb', line 123 def extract_json_ld_price(check, doc) doc.css('script[type="application/ld+json"]').each do |script| data = JSON.parse(script.text) # Handle @graph structure data = data['@graph'].find { |item| item['offers'] } || data if data['@graph'].is_a?(Array) offers = data['offers'] next unless offers # Handle array of offers offers = offers.first if offers.is_a?(Array) price = offers['price'] check.price = extract_numeric_price(price) if price # Also try to get regular/high price high_price = offers['highPrice'] || data['highPrice'] if high_price && check.price.present? regular = extract_numeric_price(high_price) check.regular_price = regular if regular && regular > check.price end break if check.price.present? rescue JSON::ParserError next end end |
#extract_numeric_price(text) ⇒ Float? (protected)
Extract numeric price from text
155 156 157 158 159 160 161 162 |
# File 'app/services/retailer/extractors/base.rb', line 155 def extract_numeric_price(text) price_val = if text.is_a?(Numeric) text.to_f else text.to_s.delete('^0-9.').to_f end price_val if valid_price?(price_val) end |
#extract_price_from_selectors(check, doc, selectors) ⇒ Object (protected)
Extract price from common CSS selectors
175 176 177 178 179 180 181 182 183 184 185 186 |
# File 'app/services/retailer/extractors/base.rb', line 175 def extract_price_from_selectors(check, doc, selectors) selectors.each do |selector| el = doc.at_css(selector) next unless el price_val = extract_numeric_price(el['content'] || el.text) if valid_price?(price_val) check.price = price_val break end end end |
#extract_product_link_from_search(doc, selectors) ⇒ String? (protected)
Extract product link from search results
70 71 72 73 74 75 76 77 78 79 80 81 82 |
# File 'app/services/retailer/extractors/base.rb', line 70 def extract_product_link_from_search(doc, selectors) selectors.each do |selector| link = doc.at_css(selector) next unless link href = link['href'] next if href.blank? # Make absolute URL if relative return make_absolute_url(href) if href.present? end nil end |
#extract_title(doc) ⇒ String? (protected)
Extract title from common selectors
202 203 204 205 |
# File 'app/services/retailer/extractors/base.rb', line 202 def extract_title(doc) title_el = doc.at_css('h1') || doc.at_css('[data-testid="product-title"]') title_el&.text&.strip&.truncate(255) end |
#make_absolute_url(href) ⇒ String (protected)
Convert relative URL to absolute
87 88 89 90 91 92 93 94 95 96 |
# File 'app/services/retailer/extractors/base.rb', line 87 def make_absolute_url(href) return href if href.start_with?('http') base_url = catalog_base_url return href unless base_url URI.join(base_url, href).to_s rescue URI::InvalidURIError href end |
#parse_html(html) ⇒ Nokogiri::HTML::Document (protected)
Parse HTML content with Nokogiri
108 109 110 |
# File 'app/services/retailer/extractors/base.rb', line 108 def parse_html(html) Nokogiri::HTML(html) end |
#source_name ⇒ String
Identifier for this extractor (used in check.scraper_source)
33 34 35 |
# File 'app/services/retailer/extractors/base.rb', line 33 def source_name self.class.name.demodulize.underscore end |
#valid_html?(content) ⇒ Boolean (protected)
Validate that content is HTML string
115 116 117 |
# File 'app/services/retailer/extractors/base.rb', line 115 def valid_html?(content) content.is_a?(String) && content.present? end |
#valid_price?(price) ⇒ Boolean (protected)
Validate that a price is reasonable
167 168 169 |
# File 'app/services/retailer/extractors/base.rb', line 167 def valid_price?(price) price.present? && price > 1 && price < 100_000 end |
#validate_product_identity(check, content, catalog_item) ⇒ Boolean
Validate that the scraped page actually contains our product identifiers.
This prevents false positives where a retailer redirects to a different product.
219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 |
# File 'app/services/retailer/extractors/base.rb', line 219 def validate_product_identity(check, content, catalog_item) identifiers = collect_product_identifiers(catalog_item) return true if identifiers.empty? # Skip validation if no identifiers available # Check if ANY of our identifiers appear in the page content or URL content_to_check = content.to_s.downcase url_to_check = check.url.to_s.downcase found = identifiers.any? do |identifier| next false if identifier.blank? normalized = identifier.to_s.downcase.strip content_to_check.include?(normalized) || url_to_check.include?(normalized) end unless found check.status = 'product_mismatch' check. = "Product identity validation failed: none of our identifiers (#{identifiers.compact.join(', ')}) found on page" Rails.logger.warn "[#{source_name}] Product mismatch for catalog_item #{catalog_item.id}: #{check.}" end found end |