Class: Retailer::Extractors::Base

Inherits:

Object

Object
Retailer::Extractors::Base

show all

Includes:: CatalogConstants

Defined in:: app/services/retailer/extractors/base.rb

Overview

Base class for retailer data extractors.
Uses Nokogiri for HTML parsing as recommended by Oxylabs:
https://github.com/oxylabs/webscraping-with-ruby

Examples:

Subclass implementation

class Retailer::Extractors::Amazon < Retailer::Extractors::Base
  def extract(check, content)
    # Amazon-specific extraction logic
  end
end

Direct Known Subclasses

Amazon, BestbuyCanada, BuildCom, CanadianTire, Costco, Generic, HomeDepot, Houzz, Lowes, Rona, Walmart, Wayfair

Constant Summary collapse

RENDER_REQUIRED = Whether this retailer requires JavaScript rendering for price extraction. Override in subclasses to opt out (or override the constant value to false). render: 'html' is roughly 5x more expensive at Oxylabs than non-rendered requests. Most extractors currently set true to preserve historical behavior; flipping to false per-retailer should be done one at a time alongside a manual probe to confirm the page still parses. Returns: (Boolean)

true

Constants included from CatalogConstants

Instance Attribute Summary collapse

#catalog ⇒ Object readonly
Returns the value of attribute catalog.
#discovered_url ⇒ String^? readonly
Returns a discovered direct URL if found during extraction.

Class Method Summary collapse

.render_value ⇒ String^?
Returns the Oxylabs render payload value for this extractor.

Instance Method Summary collapse

#catalog_base_url ⇒ String^? protected
Get base URL for this catalog's retailer Override in subclasses for specific domains.
#check_availability(html, unavailable_phrases = []) ⇒ Boolean protected
Check if page indicates product is available.
#collect_product_identifiers(catalog_item) ⇒ Array<String>
Collect all product identifiers that can be used to validate the page.
#extract(check, content) ⇒ void
Extract data from content and populate the check record.
#extract_canonical_url(doc) ⇒ String^? protected
Extract canonical URL from page head or og:url meta tag.
#extract_json_ld_price(check, doc) ⇒ Object protected
Extract price from JSON-LD structured data (schema.org) Most reliable method across retailers.
#extract_numeric_price(text) ⇒ Float^? protected
Extract numeric price from text.
#extract_price_from_selectors(check, doc, selectors) ⇒ Object protected
Extract price from common CSS selectors.
#extract_product_link_from_search(doc, selectors) ⇒ String^? protected
Extract product link from search results.
#extract_title(doc) ⇒ String^? protected
Extract title from common selectors.
#initialize(catalog) ⇒ Base constructor
A new instance of Base.
#make_absolute_url(href) ⇒ String protected
Convert relative URL to absolute.
#parse_html(html) ⇒ Nokogiri::HTML::Document protected
Parse HTML content with Nokogiri.
#source_name ⇒ String
Identifier for this extractor (used in check.scraper_source).
#valid_html?(content) ⇒ Boolean protected
Validate that content is HTML string.
#valid_price?(price) ⇒ Boolean protected
Validate that a price is reasonable.
#validate_product_identity(check, content, catalog_item) ⇒ Boolean
Validate that the scraped page actually contains our product identifiers.

Methods included from CatalogConstants

amazon_catalog?, amazon_seller_catalog?, costco_catalog?, home_depot_catalog?, walmart_catalog?, wayfair_catalog?

Constructor Details

#initialize(catalog) ⇒ `Base`

Returns a new instance of Base.



36
37
38

# File 'app/services/retailer/extractors/base.rb', line 36

def initialize(catalog)
  @catalog = catalog
end

Instance Attribute Details

#catalog ⇒ `Object` (readonly)

Returns the value of attribute catalog.



34
35
36

# File 'app/services/retailer/extractors/base.rb', line 34

def catalog
  @catalog
end

#discovered_url ⇒ `String`^? (readonly)

Returns a discovered direct URL if found during extraction.
Used to capture and store canonical URLs for future direct access.
Override in subclasses that can discover URLs (e.g., from search results).

Returns:

(String, nil)



58
59
60

# File 'app/services/retailer/extractors/base.rb', line 58

def discovered_url
  @discovered_url
end

Class Method Details

.render_value ⇒ `String`^?

Returns the Oxylabs render payload value for this extractor.

Returns:

(String, nil) —
'html' if RENDER_REQUIRED, otherwise nil



30
31
32

# File 'app/services/retailer/extractors/base.rb', line 30

def self.render_value
  self::RENDER_REQUIRED ? 'html' : nil
end

Instance Method Details

#catalog_base_url ⇒ `String`^? (protected)

Get base URL for this catalog's retailer
Override in subclasses for specific domains

Returns:

(String, nil)



118
119
120

# File 'app/services/retailer/extractors/base.rb', line 118

def catalog_base_url
  nil
end

#check_availability(html, unavailable_phrases = []) ⇒ `Boolean` (protected)

Check if page indicates product is available

Parameters:

html (String) —
HTML content
unavailable_phrases (Array<String>) (defaults to: []) —
Phrases indicating unavailability

Returns:

(Boolean)

# File 'app/services/retailer/extractors/base.rb', line 209

def check_availability(html, unavailable_phrases = [])
  default_phrases = ['Out of Stock', 'Sold Out', 'Currently Unavailable', 'Not Available']
  phrases = unavailable_phrases + default_phrases

  phrases.none? { |phrase| html.include?(phrase) }
end

#collect_product_identifiers(catalog_item) ⇒ `Array<String>`

Collect all product identifiers that can be used to validate the page

Parameters:

catalog_item (CatalogItem)

Returns:

(Array<String>) —
List of identifiers (SKU, UPC, third party number, etc.)

# File 'app/services/retailer/extractors/base.rb', line 263

def collect_product_identifiers(catalog_item)
  identifiers = []

  # Our internal SKU
  identifiers << catalog_item.sku

  # UPC from the parent item
  identifiers << catalog_item.store_item&.item&.upc

  # Third party number (retailer's part number)
  identifiers << catalog_item.third_party_part_number

  # Third party SKU (retailer-assigned / our marketplace SKU)
  identifiers << catalog_item.third_party_sku

  # Variant selector — the Wayfair piid that pins the exact size on a shared PDP
  identifiers << catalog_item.third_party_sku_variant_id

  # Parent SKU (e.g., WRM1245 for Wayfair variants)
  # This is used in search URLs and should appear on the page
  identifiers << catalog_item.parent_sku

  identifiers.compact.compact_blank.uniq
end

#extract(check, content) ⇒ `void`

This method returns an undefined value.

Extract data from content and populate the check record

Parameters:

check (CatalogItemRetailerProbe) —
The probe record to update
content (String, Hash) —
HTML string or parsed data

Raises:

(NotImplementedError)



44
45
46

# File 'app/services/retailer/extractors/base.rb', line 44

def extract(check, content)
  raise NotImplementedError, 'Subclasses must implement #extract'
end

#extract_canonical_url(doc) ⇒ `String`^? (protected)

Extract canonical URL from page head or og:url meta tag

Parameters:

doc (Nokogiri::HTML::Document) —
Parsed HTML document

Returns:

(String, nil)

# File 'app/services/retailer/extractors/base.rb', line 65

def extract_canonical_url(doc)
  # Try rel="canonical" first (most reliable)
  canonical_el = doc.at_css('link[rel="canonical"]')
  if canonical_el
    url = canonical_el['href']
    return url if url.present? && url.start_with?('http')
  end

  # Try og:url meta tag
  og_url = doc.at_css('meta[property="og:url"]')
  if og_url
    url = og_url['content']
    return url if url.present? && url.start_with?('http')
  end

  nil
end

#extract_json_ld_price(check, doc) ⇒ `Object` (protected)

Extract price from JSON-LD structured data (schema.org)
Most reliable method across retailers

Parameters:

check (CatalogItemRetailerProbe) —
The probe record to update
doc (Nokogiri::HTML::Document) —
Parsed HTML document

# File 'app/services/retailer/extractors/base.rb', line 140

def extract_json_ld_price(check, doc)
  doc.css('script[type="application/ld+json"]').each do |script|
    data = JSON.parse(script.text)

    # Handle @graph structure
    data = data['@graph'].find { |item| item['offers'] } || data if data['@graph'].is_a?(Array)

    offers = data['offers']
    next unless offers

    # Handle array of offers
    offers = offers.first if offers.is_a?(Array)

    price = offers['price']
    check.price = extract_numeric_price(price) if price

    # Also try to get regular/high price
    high_price = offers['highPrice'] || data['highPrice']
    if high_price && check.price.present?
      regular = extract_numeric_price(high_price)
      check.regular_price = regular if regular && regular > check.price
    end

    break if check.price.present?
  rescue JSON::ParserError
    next
  end
end

#extract_numeric_price(text) ⇒ `Float`^? (protected)

Extract numeric price from text

Parameters:

text (String, Numeric) —
Price text or number

Returns:

(Float, nil)

# File 'app/services/retailer/extractors/base.rb', line 172

def extract_numeric_price(text)
  price_val = if text.is_a?(Numeric)
                text.to_f
              else
                text.to_s.delete('^0-9.').to_f
              end
  price_val if valid_price?(price_val)
end

#extract_price_from_selectors(check, doc, selectors) ⇒ `Object` (protected)

Extract price from common CSS selectors

Parameters:

check (CatalogItemRetailerProbe) —
The probe record to update
doc (Nokogiri::HTML::Document) —
Parsed HTML document
selectors (Array<String>) —
CSS selectors to try

# File 'app/services/retailer/extractors/base.rb', line 192

def extract_price_from_selectors(check, doc, selectors)
  selectors.each do |selector|
    el = doc.at_css(selector)
    next unless el

    price_val = extract_numeric_price(el['content'] || el.text)
    if valid_price?(price_val)
      check.price = price_val
      break
    end
  end
end

#extract_product_link_from_search(doc, selectors) ⇒ `String`^? (protected)

Extract product link from search results

Parameters:

doc (Nokogiri::HTML::Document) —
Parsed HTML document
selectors (Array<String>) —
CSS selectors for product links

Returns:

(String, nil)

# File 'app/services/retailer/extractors/base.rb', line 87

def extract_product_link_from_search(doc, selectors)
  selectors.each do |selector|
    link = doc.at_css(selector)
    next unless link

    href = link['href']
    next if href.blank?

    # Make absolute URL if relative
    return make_absolute_url(href) if href.present?
  end
  nil
end

#extract_title(doc) ⇒ `String`^? (protected)

Extract title from common selectors

Parameters:

doc (Nokogiri::HTML::Document) —
Parsed HTML document

Returns:

(String, nil)

# File 'app/services/retailer/extractors/base.rb', line 219

def extract_title(doc)
  title_el = doc.at_css('h1') || doc.at_css('[data-testid="product-title"]')
  title_el&.text&.strip&.truncate(255)
end

#make_absolute_url(href) ⇒ `String` (protected)

Convert relative URL to absolute

Parameters:

href (String) —
Relative or absolute URL

Returns:

(String)

# File 'app/services/retailer/extractors/base.rb', line 104

def make_absolute_url(href)
  return href if href.start_with?('http')

  base_url = catalog_base_url
  return href unless base_url

  URI.join(base_url, href).to_s
rescue URI::InvalidURIError
  href
end

#parse_html(html) ⇒ `Nokogiri::HTML::Document` (protected)

Parse HTML content with Nokogiri

Parameters:

html (String) —
HTML content

Returns:

(Nokogiri::HTML::Document)



125
126
127

# File 'app/services/retailer/extractors/base.rb', line 125

def parse_html(html)
  Nokogiri::HTML(html)
end

#source_name ⇒ `String`

Identifier for this extractor (used in check.scraper_source)

Returns:

(String)



50
51
52

# File 'app/services/retailer/extractors/base.rb', line 50

def source_name
  self.class.name.demodulize.underscore
end

#valid_html?(content) ⇒ `Boolean` (protected)

Validate that content is HTML string

Parameters:

content (Object)

Returns:

(Boolean)



132
133
134

# File 'app/services/retailer/extractors/base.rb', line 132

def valid_html?(content)
  content.is_a?(String) && content.present?
end

#valid_price?(price) ⇒ `Boolean` (protected)

Validate that a price is reasonable

Parameters:

price (Float) —
Price value

Returns:

(Boolean)



184
185
186

# File 'app/services/retailer/extractors/base.rb', line 184

def valid_price?(price)
  price.present? && price > 1 && price < 100_000
end

#validate_product_identity(check, content, catalog_item) ⇒ `Boolean`

Validate that the scraped page actually contains our product identifiers.
This prevents false positives where a retailer redirects to a different product.

Parameters:

check (CatalogItemRetailerProbe) —
The probe record to update
content (String) —
HTML content (or URL for URL-based validation)
catalog_item (CatalogItem) —
The catalog item being checked

Returns:

(Boolean) —
true if validation passed, false if product mismatch detected

# File 'app/services/retailer/extractors/base.rb', line 236

def validate_product_identity(check, content, catalog_item)
  identifiers = collect_product_identifiers(catalog_item)
  return true if identifiers.empty? # Skip validation if no identifiers available

  # Check if ANY of our identifiers appear in the page content or URL
  content_to_check = content.to_s.downcase
  url_to_check = check.url.to_s.downcase

  found = identifiers.any? do |identifier|
    next false if identifier.blank?

    normalized = identifier.to_s.downcase.strip
    content_to_check.include?(normalized) || url_to_check.include?(normalized)
  end

  unless found
    check.status = 'product_mismatch'
    check.error_message = "Product identity validation failed: none of our identifiers (#{identifiers.compact.join(', ')}) found on page"
    Rails.logger.warn "[#{source_name}] Product mismatch for catalog_item #{catalog_item.id}: #{check.error_message}"
  end

  found
end

Class: Retailer::Extractors::Base

Overview

Examples:

Subclass implementation

Direct Known Subclasses

Constant Summary collapse

Constants included from CatalogConstants

Instance Attribute Summary collapse

Class Method Summary collapse

Instance Method Summary collapse

Methods included from CatalogConstants

Constructor Details

#initialize(catalog) ⇒ Base

Instance Attribute Details

#catalog ⇒ Object (readonly)

#discovered_url ⇒ String? (readonly)

Class Method Details

.render_value ⇒ String?

Instance Method Details

#catalog_base_url ⇒ String? (protected)

#check_availability(html, unavailable_phrases = []) ⇒ Boolean (protected)

#collect_product_identifiers(catalog_item) ⇒ Array<String>

#extract(check, content) ⇒ void

#extract_canonical_url(doc) ⇒ String? (protected)

#extract_json_ld_price(check, doc) ⇒ Object (protected)

#extract_numeric_price(text) ⇒ Float? (protected)

#extract_price_from_selectors(check, doc, selectors) ⇒ Object (protected)

#extract_product_link_from_search(doc, selectors) ⇒ String? (protected)

#extract_title(doc) ⇒ String? (protected)

#make_absolute_url(href) ⇒ String (protected)

#parse_html(html) ⇒ Nokogiri::HTML::Document (protected)

#source_name ⇒ String

#valid_html?(content) ⇒ Boolean (protected)

#valid_price?(price) ⇒ Boolean (protected)

#validate_product_identity(check, content, catalog_item) ⇒ Boolean

#initialize(catalog) ⇒ `Base`

#catalog ⇒ `Object` (readonly)

#discovered_url ⇒ `String`^? (readonly)

.render_value ⇒ `String`^?

#catalog_base_url ⇒ `String`^? (protected)

#check_availability(html, unavailable_phrases = []) ⇒ `Boolean` (protected)

#collect_product_identifiers(catalog_item) ⇒ `Array<String>`

#extract(check, content) ⇒ `void`

#extract_canonical_url(doc) ⇒ `String`^? (protected)

#extract_json_ld_price(check, doc) ⇒ `Object` (protected)

#extract_numeric_price(text) ⇒ `Float`^? (protected)

#extract_price_from_selectors(check, doc, selectors) ⇒ `Object` (protected)

#extract_product_link_from_search(doc, selectors) ⇒ `String`^? (protected)

#extract_title(doc) ⇒ `String`^? (protected)

#make_absolute_url(href) ⇒ `String` (protected)

#parse_html(html) ⇒ `Nokogiri::HTML::Document` (protected)

#source_name ⇒ `String`

#valid_html?(content) ⇒ `Boolean` (protected)

#valid_price?(price) ⇒ `Boolean` (protected)

#validate_product_identity(check, content, catalog_item) ⇒ `Boolean`