Class: Retailer::Extractors::Base

Inherits:
Object
  • Object
show all
Includes:
CatalogConstants
Defined in:
app/services/retailer/extractors/base.rb

Overview

Base class for retailer data extractors.
Uses Nokogiri for HTML parsing as recommended by Oxylabs:
https://github.com/oxylabs/webscraping-with-ruby

Examples:

Subclass implementation

class Retailer::Extractors::Amazon < Retailer::Extractors::Base
  def extract(check, content)
    # Amazon-specific extraction logic
  end
end

Constant Summary collapse

RENDER_REQUIRED =

Whether this retailer requires JavaScript rendering for price extraction.
Override in subclasses to opt out (or override the constant value to false).

render: 'html' is roughly 5x more expensive at Oxylabs than non-rendered
requests. Most extractors currently set true to preserve historical
behavior; flipping to false per-retailer should be done one at a time
alongside a manual probe to confirm the page still parses.

Returns:

  • (Boolean)
true

Constants included from CatalogConstants

CatalogConstants::ALL_MAIN_CATALOG_IDS, CatalogConstants::AMAZON_CATALOG_IDS, CatalogConstants::AMAZON_CA_CATALOG_IDS, CatalogConstants::AMAZON_EU_CATALOG_IDS, CatalogConstants::AMAZON_NA_SELLER_IDS, CatalogConstants::AMAZON_SC_BE_CATALOG_ID, CatalogConstants::AMAZON_SC_CATALOG_IDS, CatalogConstants::AMAZON_SC_CA_CATALOG_ID, CatalogConstants::AMAZON_SC_DE_CATALOG_ID, CatalogConstants::AMAZON_SC_ES_CATALOG_ID, CatalogConstants::AMAZON_SC_FR_CATALOG_ID, CatalogConstants::AMAZON_SC_IT_CATALOG_ID, CatalogConstants::AMAZON_SC_NL_CATALOG_ID, CatalogConstants::AMAZON_SC_PL_CATALOG_ID, CatalogConstants::AMAZON_SC_SE_CATALOG_ID, CatalogConstants::AMAZON_SC_UK_CATALOG_ID, CatalogConstants::AMAZON_SC_US_CATALOG_ID, CatalogConstants::AMAZON_SELLER_IDS, CatalogConstants::AMAZON_US_CATALOG_IDS, CatalogConstants::AMAZON_VC_CATALOG_IDS, CatalogConstants::AMAZON_VC_CA_CATALOG_ID, CatalogConstants::AMAZON_VC_CA_CATALOG_IDS, CatalogConstants::AMAZON_VC_DIRECT_FULFILLMENT_CATALOG_IDS, CatalogConstants::AMAZON_VC_US_CATALOG_IDS, CatalogConstants::AMAZON_VC_US_WASN4_CATALOG_ID, CatalogConstants::AMAZON_VC_US_WAX7V_CATALOG_ID, CatalogConstants::AMAZON_VC_WAT0F_CA_CATALOG_ID, CatalogConstants::AMAZON_VC_WAT4D_CA_CATALOG_ID, CatalogConstants::AMAZON_VENDOR_CODE_TO_CATALOG_ID, CatalogConstants::BESTBUY_CANADA, CatalogConstants::BUILD_COM, CatalogConstants::CANADIAN_TIRE, CatalogConstants::CA_CATALOG_ID, CatalogConstants::COSTCO_CANADA, CatalogConstants::COSTCO_CATALOGS, CatalogConstants::COSTCO_USA, CatalogConstants::EU_CATALOG_ID, CatalogConstants::HOME_DEPOT_CANADA, CatalogConstants::HOME_DEPOT_CATALOGS, CatalogConstants::HOME_DEPOT_USA, CatalogConstants::HOUZZ, CatalogConstants::LOCALE_TO_CATALOG, CatalogConstants::LOWES_CANADA, CatalogConstants::LOWES_USA, CatalogConstants::RONA_CANADA, CatalogConstants::US_CATALOG_ID, CatalogConstants::WALMART_CATALOGS, CatalogConstants::WALMART_SELLER_CANADA, CatalogConstants::WALMART_SELLER_USA, CatalogConstants::WAYFAIR_CANADA, CatalogConstants::WAYFAIR_CATALOGS, CatalogConstants::WAYFAIR_GERMANY, CatalogConstants::WAYFAIR_USA

Instance Attribute Summary collapse

Class Method Summary collapse

Instance Method Summary collapse

Methods included from CatalogConstants

amazon_catalog?, amazon_seller_catalog?, costco_catalog?, home_depot_catalog?, walmart_catalog?, wayfair_catalog?

Constructor Details

#initialize(catalog) ⇒ Base

Returns a new instance of Base.



36
37
38
# File 'app/services/retailer/extractors/base.rb', line 36

def initialize(catalog)
  @catalog = catalog
end

Instance Attribute Details

#catalogObject (readonly)

Returns the value of attribute catalog.



34
35
36
# File 'app/services/retailer/extractors/base.rb', line 34

def catalog
  @catalog
end

#discovered_urlString? (readonly)

Returns a discovered direct URL if found during extraction.
Used to capture and store canonical URLs for future direct access.
Override in subclasses that can discover URLs (e.g., from search results).

Returns:

  • (String, nil)


58
59
60
# File 'app/services/retailer/extractors/base.rb', line 58

def discovered_url
  @discovered_url
end

Class Method Details

.render_valueString?

Returns the Oxylabs render payload value for this extractor.

Returns:



30
31
32
# File 'app/services/retailer/extractors/base.rb', line 30

def self.render_value
  self::RENDER_REQUIRED ? 'html' : nil
end

Instance Method Details

#catalog_base_urlString? (protected)

Get base URL for this catalog's retailer
Override in subclasses for specific domains

Returns:

  • (String, nil)


118
119
120
# File 'app/services/retailer/extractors/base.rb', line 118

def catalog_base_url
  nil
end

#check_availability(html, unavailable_phrases = []) ⇒ Boolean (protected)

Check if page indicates product is available

Parameters:

  • html (String)

    HTML content

  • unavailable_phrases (Array<String>) (defaults to: [])

    Phrases indicating unavailability

Returns:

  • (Boolean)


209
210
211
212
213
214
# File 'app/services/retailer/extractors/base.rb', line 209

def check_availability(html, unavailable_phrases = [])
  default_phrases = ['Out of Stock', 'Sold Out', 'Currently Unavailable', 'Not Available']
  phrases = unavailable_phrases + default_phrases

  phrases.none? { |phrase| html.include?(phrase) }
end

#collect_product_identifiers(catalog_item) ⇒ Array<String>

Collect all product identifiers that can be used to validate the page

Parameters:

Returns:

  • (Array<String>)

    List of identifiers (SKU, UPC, third party number, etc.)



263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
# File 'app/services/retailer/extractors/base.rb', line 263

def collect_product_identifiers(catalog_item)
  identifiers = []

  # Our internal SKU
  identifiers << catalog_item.sku

  # UPC from the parent item
  identifiers << catalog_item.store_item&.item&.upc

  # Third party number (retailer's part number)
  identifiers << catalog_item.third_party_part_number

  # Third party SKU (retailer-assigned / our marketplace SKU)
  identifiers << catalog_item.third_party_sku

  # Variant selector — the Wayfair piid that pins the exact size on a shared PDP
  identifiers << catalog_item.third_party_sku_variant_id

  # Parent SKU (e.g., WRM1245 for Wayfair variants)
  # This is used in search URLs and should appear on the page
  identifiers << catalog_item.parent_sku

  identifiers.compact.compact_blank.uniq
end

#extract(check, content) ⇒ void

This method returns an undefined value.

Extract data from content and populate the check record

Parameters:

Raises:

  • (NotImplementedError)


44
45
46
# File 'app/services/retailer/extractors/base.rb', line 44

def extract(check, content)
  raise NotImplementedError, 'Subclasses must implement #extract'
end

#extract_canonical_url(doc) ⇒ String? (protected)

Extract canonical URL from page head or og:url meta tag

Parameters:

  • doc (Nokogiri::HTML::Document)

    Parsed HTML document

Returns:

  • (String, nil)


65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
# File 'app/services/retailer/extractors/base.rb', line 65

def extract_canonical_url(doc)
  # Try rel="canonical" first (most reliable)
  canonical_el = doc.at_css('link[rel="canonical"]')
  if canonical_el
    url = canonical_el['href']
    return url if url.present? && url.start_with?('http')
  end

  # Try og:url meta tag
  og_url = doc.at_css('meta[property="og:url"]')
  if og_url
    url = og_url['content']
    return url if url.present? && url.start_with?('http')
  end

  nil
end

#extract_json_ld_price(check, doc) ⇒ Object (protected)

Extract price from JSON-LD structured data (schema.org)
Most reliable method across retailers

Parameters:

  • check (CatalogItemRetailerProbe)

    The probe record to update

  • doc (Nokogiri::HTML::Document)

    Parsed HTML document



140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
# File 'app/services/retailer/extractors/base.rb', line 140

def extract_json_ld_price(check, doc)
  doc.css('script[type="application/ld+json"]').each do |script|
    data = JSON.parse(script.text)

    # Handle @graph structure
    data = data['@graph'].find { |item| item['offers'] } || data if data['@graph'].is_a?(Array)

    offers = data['offers']
    next unless offers

    # Handle array of offers
    offers = offers.first if offers.is_a?(Array)

    price = offers['price']
    check.price = extract_numeric_price(price) if price

    # Also try to get regular/high price
    high_price = offers['highPrice'] || data['highPrice']
    if high_price && check.price.present?
      regular = extract_numeric_price(high_price)
      check.regular_price = regular if regular && regular > check.price
    end

    break if check.price.present?
  rescue JSON::ParserError
    next
  end
end

#extract_numeric_price(text) ⇒ Float? (protected)

Extract numeric price from text

Parameters:

  • text (String, Numeric)

    Price text or number

Returns:

  • (Float, nil)


172
173
174
175
176
177
178
179
# File 'app/services/retailer/extractors/base.rb', line 172

def extract_numeric_price(text)
  price_val = if text.is_a?(Numeric)
                text.to_f
              else
                text.to_s.delete('^0-9.').to_f
              end
  price_val if valid_price?(price_val)
end

#extract_price_from_selectors(check, doc, selectors) ⇒ Object (protected)

Extract price from common CSS selectors

Parameters:

  • check (CatalogItemRetailerProbe)

    The probe record to update

  • doc (Nokogiri::HTML::Document)

    Parsed HTML document

  • selectors (Array<String>)

    CSS selectors to try



192
193
194
195
196
197
198
199
200
201
202
203
# File 'app/services/retailer/extractors/base.rb', line 192

def extract_price_from_selectors(check, doc, selectors)
  selectors.each do |selector|
    el = doc.at_css(selector)
    next unless el

    price_val = extract_numeric_price(el['content'] || el.text)
    if valid_price?(price_val)
      check.price = price_val
      break
    end
  end
end

Extract product link from search results

Parameters:

  • doc (Nokogiri::HTML::Document)

    Parsed HTML document

  • selectors (Array<String>)

    CSS selectors for product links

Returns:

  • (String, nil)


87
88
89
90
91
92
93
94
95
96
97
98
99
# File 'app/services/retailer/extractors/base.rb', line 87

def extract_product_link_from_search(doc, selectors)
  selectors.each do |selector|
    link = doc.at_css(selector)
    next unless link

    href = link['href']
    next if href.blank?

    # Make absolute URL if relative
    return make_absolute_url(href) if href.present?
  end
  nil
end

#extract_title(doc) ⇒ String? (protected)

Extract title from common selectors

Parameters:

  • doc (Nokogiri::HTML::Document)

    Parsed HTML document

Returns:

  • (String, nil)


219
220
221
222
# File 'app/services/retailer/extractors/base.rb', line 219

def extract_title(doc)
  title_el = doc.at_css('h1') || doc.at_css('[data-testid="product-title"]')
  title_el&.text&.strip&.truncate(255)
end

#make_absolute_url(href) ⇒ String (protected)

Convert relative URL to absolute

Parameters:

  • href (String)

    Relative or absolute URL

Returns:

  • (String)


104
105
106
107
108
109
110
111
112
113
# File 'app/services/retailer/extractors/base.rb', line 104

def make_absolute_url(href)
  return href if href.start_with?('http')

  base_url = catalog_base_url
  return href unless base_url

  URI.join(base_url, href).to_s
rescue URI::InvalidURIError
  href
end

#parse_html(html) ⇒ Nokogiri::HTML::Document (protected)

Parse HTML content with Nokogiri

Parameters:

  • html (String)

    HTML content

Returns:

  • (Nokogiri::HTML::Document)


125
126
127
# File 'app/services/retailer/extractors/base.rb', line 125

def parse_html(html)
  Nokogiri::HTML(html)
end

#source_nameString

Identifier for this extractor (used in check.scraper_source)

Returns:

  • (String)


50
51
52
# File 'app/services/retailer/extractors/base.rb', line 50

def source_name
  self.class.name.demodulize.underscore
end

#valid_html?(content) ⇒ Boolean (protected)

Validate that content is HTML string

Parameters:

  • content (Object)

Returns:

  • (Boolean)


132
133
134
# File 'app/services/retailer/extractors/base.rb', line 132

def valid_html?(content)
  content.is_a?(String) && content.present?
end

#valid_price?(price) ⇒ Boolean (protected)

Validate that a price is reasonable

Parameters:

  • price (Float)

    Price value

Returns:

  • (Boolean)


184
185
186
# File 'app/services/retailer/extractors/base.rb', line 184

def valid_price?(price)
  price.present? && price > 1 && price < 100_000
end

#validate_product_identity(check, content, catalog_item) ⇒ Boolean

Validate that the scraped page actually contains our product identifiers.
This prevents false positives where a retailer redirects to a different product.

Parameters:

  • check (CatalogItemRetailerProbe)

    The probe record to update

  • content (String)

    HTML content (or URL for URL-based validation)

  • catalog_item (CatalogItem)

    The catalog item being checked

Returns:

  • (Boolean)

    true if validation passed, false if product mismatch detected



236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
# File 'app/services/retailer/extractors/base.rb', line 236

def validate_product_identity(check, content, catalog_item)
  identifiers = collect_product_identifiers(catalog_item)
  return true if identifiers.empty? # Skip validation if no identifiers available

  # Check if ANY of our identifiers appear in the page content or URL
  content_to_check = content.to_s.downcase
  url_to_check = check.url.to_s.downcase

  found = identifiers.any? do |identifier|
    next false if identifier.blank?

    normalized = identifier.to_s.downcase.strip
    content_to_check.include?(normalized) || url_to_check.include?(normalized)
  end

  unless found
    check.status = 'product_mismatch'
    check.error_message = "Product identity validation failed: none of our identifiers (#{identifiers.compact.join(', ')}) found on page"
    Rails.logger.warn "[#{source_name}] Product mismatch for catalog_item #{catalog_item.id}: #{check.error_message}"
  end

  found
end