Class: Retailer::Extractors::Base

Inherits:
Object
  • Object
show all
Includes:
CatalogConstants
Defined in:
app/services/retailer/extractors/base.rb

Overview

Base class for retailer data extractors.
Uses Nokogiri for HTML parsing as recommended by Oxylabs:
https://github.com/oxylabs/webscraping-with-ruby

Examples:

Subclass implementation

class Retailer::Extractors::Amazon < Retailer::Extractors::Base
  def extract(check, content)
    # Amazon-specific extraction logic
  end
end

Constant Summary

Constants included from CatalogConstants

CatalogConstants::ALL_MAIN_CATALOG_IDS, CatalogConstants::AMAZON_CATALOG_IDS, CatalogConstants::AMAZON_CA_CATALOG_IDS, CatalogConstants::AMAZON_EU_CATALOG_IDS, CatalogConstants::AMAZON_NA_SELLER_IDS, CatalogConstants::AMAZON_SC_BE_CATALOG_ID, CatalogConstants::AMAZON_SC_CATALOG_IDS, CatalogConstants::AMAZON_SC_CA_CATALOG_ID, CatalogConstants::AMAZON_SC_DE_CATALOG_ID, CatalogConstants::AMAZON_SC_ES_CATALOG_ID, CatalogConstants::AMAZON_SC_FR_CATALOG_ID, CatalogConstants::AMAZON_SC_IT_CATALOG_ID, CatalogConstants::AMAZON_SC_NL_CATALOG_ID, CatalogConstants::AMAZON_SC_PL_CATALOG_ID, CatalogConstants::AMAZON_SC_SE_CATALOG_ID, CatalogConstants::AMAZON_SC_UK_CATALOG_ID, CatalogConstants::AMAZON_SC_US_CATALOG_ID, CatalogConstants::AMAZON_SELLER_IDS, CatalogConstants::AMAZON_US_CATALOG_IDS, CatalogConstants::AMAZON_VC_CATALOG_IDS, CatalogConstants::AMAZON_VC_CA_CATALOG_ID, CatalogConstants::AMAZON_VC_CA_CATALOG_IDS, CatalogConstants::AMAZON_VC_DIRECT_FULFILLMENT_CATALOG_IDS, CatalogConstants::AMAZON_VC_US_CATALOG_IDS, CatalogConstants::AMAZON_VC_US_WASN4_CATALOG_ID, CatalogConstants::AMAZON_VC_US_WAX7V_CATALOG_ID, CatalogConstants::AMAZON_VC_WAT0F_CA_CATALOG_ID, CatalogConstants::AMAZON_VC_WAT4D_CA_CATALOG_ID, CatalogConstants::AMAZON_VENDOR_CODE_TO_CATALOG_ID, CatalogConstants::BESTBUY_CANADA, CatalogConstants::BUILD_COM, CatalogConstants::CANADIAN_TIRE, CatalogConstants::CA_CATALOG_ID, CatalogConstants::COSTCO_CANADA, CatalogConstants::COSTCO_CATALOGS, CatalogConstants::COSTCO_USA, CatalogConstants::EU_CATALOG_ID, CatalogConstants::HOME_DEPOT_CANADA, CatalogConstants::HOME_DEPOT_CATALOGS, CatalogConstants::HOME_DEPOT_USA, CatalogConstants::HOUZZ, CatalogConstants::LOCALE_TO_CATALOG, CatalogConstants::LOWES_CANADA, CatalogConstants::LOWES_USA, CatalogConstants::MARKETPLACE_CATALOGS, CatalogConstants::PRICE_CHECK_ENABLED_CATALOGS, CatalogConstants::RONA_CANADA, CatalogConstants::US_CATALOG_ID, CatalogConstants::VENDOR_CATALOGS, CatalogConstants::WALMART_CATALOGS, CatalogConstants::WALMART_SELLER_CANADA, CatalogConstants::WALMART_SELLER_USA, CatalogConstants::WAYFAIR_CANADA, CatalogConstants::WAYFAIR_CATALOGS, CatalogConstants::WAYFAIR_GERMANY, CatalogConstants::WAYFAIR_USA

Instance Attribute Summary collapse

Instance Method Summary collapse

Methods included from CatalogConstants

amazon_catalog?, amazon_seller_catalog?, costco_catalog?, home_depot_catalog?, marketplace_catalog?, price_check_enabled?, vendor_catalog?, walmart_catalog?, wayfair_catalog?

Constructor Details

#initialize(catalog) ⇒ Base

Returns a new instance of Base.



19
20
21
# File 'app/services/retailer/extractors/base.rb', line 19

def initialize(catalog)
  @catalog = catalog
end

Instance Attribute Details

#catalogObject (readonly)

Returns the value of attribute catalog.



17
18
19
# File 'app/services/retailer/extractors/base.rb', line 17

def catalog
  @catalog
end

#discovered_urlString? (readonly)

Returns a discovered direct URL if found during extraction.
Used to capture and store canonical URLs for future direct access.
Override in subclasses that can discover URLs (e.g., from search results).

Returns:

  • (String, nil)


41
42
43
# File 'app/services/retailer/extractors/base.rb', line 41

def discovered_url
  @discovered_url
end

Instance Method Details

#catalog_base_urlString? (protected)

Get base URL for this catalog's retailer
Override in subclasses for specific domains

Returns:

  • (String, nil)


101
102
103
# File 'app/services/retailer/extractors/base.rb', line 101

def catalog_base_url
  nil
end

#check_availability(html, unavailable_phrases = []) ⇒ Boolean (protected)

Check if page indicates product is available

Parameters:

  • html (String)

    HTML content

  • unavailable_phrases (Array<String>) (defaults to: [])

    Phrases indicating unavailability

Returns:

  • (Boolean)


192
193
194
195
196
197
# File 'app/services/retailer/extractors/base.rb', line 192

def check_availability(html, unavailable_phrases = [])
  default_phrases = ['Out of Stock', 'Sold Out', 'Currently Unavailable', 'Not Available']
  phrases = unavailable_phrases + default_phrases

  phrases.none? { |phrase| html.include?(phrase) }
end

#collect_product_identifiers(catalog_item) ⇒ Array<String>

Collect all product identifiers that can be used to validate the page

Parameters:

Returns:

  • (Array<String>)

    List of identifiers (SKU, UPC, third party number, etc.)



246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
# File 'app/services/retailer/extractors/base.rb', line 246

def collect_product_identifiers(catalog_item)
  identifiers = []

  # Our internal SKU
  identifiers << catalog_item.sku

  # UPC from the parent item
  identifiers << catalog_item.store_item&.item&.upc

  # Third party number (retailer's part number)
  identifiers << catalog_item.third_party_part_number

  # Third party SKU (e.g., Wayfair piid)
  identifiers << catalog_item.third_party_sku

  # Parent SKU (e.g., WRM1245 for Wayfair variants)
  # This is used in search URLs and should appear on the page
  identifiers << catalog_item.parent_sku

  identifiers.compact.reject(&:blank?).uniq
end

#extract(check, content) ⇒ void

This method returns an undefined value.

Extract data from content and populate the check record

Parameters:

Raises:

  • (NotImplementedError)


27
28
29
# File 'app/services/retailer/extractors/base.rb', line 27

def extract(check, content)
  raise NotImplementedError, 'Subclasses must implement #extract'
end

#extract_canonical_url(doc) ⇒ String? (protected)

Extract canonical URL from page head or og:url meta tag

Parameters:

  • doc (Nokogiri::HTML::Document)

    Parsed HTML document

Returns:

  • (String, nil)


48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
# File 'app/services/retailer/extractors/base.rb', line 48

def extract_canonical_url(doc)
  # Try rel="canonical" first (most reliable)
  canonical_el = doc.at_css('link[rel="canonical"]')
  if canonical_el
    url = canonical_el['href']
    return url if url.present? && url.start_with?('http')
  end

  # Try og:url meta tag
  og_url = doc.at_css('meta[property="og:url"]')
  if og_url
    url = og_url['content']
    return url if url.present? && url.start_with?('http')
  end

  nil
end

#extract_json_ld_price(check, doc) ⇒ Object (protected)

Extract price from JSON-LD structured data (schema.org)
Most reliable method across retailers

Parameters:

  • check (CatalogItemRetailerProbe)

    The probe record to update

  • doc (Nokogiri::HTML::Document)

    Parsed HTML document



123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
# File 'app/services/retailer/extractors/base.rb', line 123

def extract_json_ld_price(check, doc)
  doc.css('script[type="application/ld+json"]').each do |script|
    data = JSON.parse(script.text)

    # Handle @graph structure
    data = data['@graph'].find { |item| item['offers'] } || data if data['@graph'].is_a?(Array)

    offers = data['offers']
    next unless offers

    # Handle array of offers
    offers = offers.first if offers.is_a?(Array)

    price = offers['price']
    check.price = extract_numeric_price(price) if price

    # Also try to get regular/high price
    high_price = offers['highPrice'] || data['highPrice']
    if high_price && check.price.present?
      regular = extract_numeric_price(high_price)
      check.regular_price = regular if regular && regular > check.price
    end

    break if check.price.present?
  rescue JSON::ParserError
    next
  end
end

#extract_numeric_price(text) ⇒ Float? (protected)

Extract numeric price from text

Parameters:

  • text (String, Numeric)

    Price text or number

Returns:

  • (Float, nil)


155
156
157
158
159
160
161
162
# File 'app/services/retailer/extractors/base.rb', line 155

def extract_numeric_price(text)
  price_val = if text.is_a?(Numeric)
                text.to_f
              else
                text.to_s.delete('^0-9.').to_f
              end
  price_val if valid_price?(price_val)
end

#extract_price_from_selectors(check, doc, selectors) ⇒ Object (protected)

Extract price from common CSS selectors

Parameters:

  • check (CatalogItemRetailerProbe)

    The probe record to update

  • doc (Nokogiri::HTML::Document)

    Parsed HTML document

  • selectors (Array<String>)

    CSS selectors to try



175
176
177
178
179
180
181
182
183
184
185
186
# File 'app/services/retailer/extractors/base.rb', line 175

def extract_price_from_selectors(check, doc, selectors)
  selectors.each do |selector|
    el = doc.at_css(selector)
    next unless el

    price_val = extract_numeric_price(el['content'] || el.text)
    if valid_price?(price_val)
      check.price = price_val
      break
    end
  end
end

Extract product link from search results

Parameters:

  • doc (Nokogiri::HTML::Document)

    Parsed HTML document

  • selectors (Array<String>)

    CSS selectors for product links

Returns:

  • (String, nil)


70
71
72
73
74
75
76
77
78
79
80
81
82
# File 'app/services/retailer/extractors/base.rb', line 70

def extract_product_link_from_search(doc, selectors)
  selectors.each do |selector|
    link = doc.at_css(selector)
    next unless link

    href = link['href']
    next if href.blank?

    # Make absolute URL if relative
    return make_absolute_url(href) if href.present?
  end
  nil
end

#extract_title(doc) ⇒ String? (protected)

Extract title from common selectors

Parameters:

  • doc (Nokogiri::HTML::Document)

    Parsed HTML document

Returns:

  • (String, nil)


202
203
204
205
# File 'app/services/retailer/extractors/base.rb', line 202

def extract_title(doc)
  title_el = doc.at_css('h1') || doc.at_css('[data-testid="product-title"]')
  title_el&.text&.strip&.truncate(255)
end

#make_absolute_url(href) ⇒ String (protected)

Convert relative URL to absolute

Parameters:

  • href (String)

    Relative or absolute URL

Returns:

  • (String)


87
88
89
90
91
92
93
94
95
96
# File 'app/services/retailer/extractors/base.rb', line 87

def make_absolute_url(href)
  return href if href.start_with?('http')

  base_url = catalog_base_url
  return href unless base_url

  URI.join(base_url, href).to_s
rescue URI::InvalidURIError
  href
end

#parse_html(html) ⇒ Nokogiri::HTML::Document (protected)

Parse HTML content with Nokogiri

Parameters:

  • html (String)

    HTML content

Returns:

  • (Nokogiri::HTML::Document)


108
109
110
# File 'app/services/retailer/extractors/base.rb', line 108

def parse_html(html)
  Nokogiri::HTML(html)
end

#source_nameString

Identifier for this extractor (used in check.scraper_source)

Returns:

  • (String)


33
34
35
# File 'app/services/retailer/extractors/base.rb', line 33

def source_name
  self.class.name.demodulize.underscore
end

#valid_html?(content) ⇒ Boolean (protected)

Validate that content is HTML string

Parameters:

  • content (Object)

Returns:

  • (Boolean)


115
116
117
# File 'app/services/retailer/extractors/base.rb', line 115

def valid_html?(content)
  content.is_a?(String) && content.present?
end

#valid_price?(price) ⇒ Boolean (protected)

Validate that a price is reasonable

Parameters:

  • price (Float)

    Price value

Returns:

  • (Boolean)


167
168
169
# File 'app/services/retailer/extractors/base.rb', line 167

def valid_price?(price)
  price.present? && price > 1 && price < 100_000
end

#validate_product_identity(check, content, catalog_item) ⇒ Boolean

Validate that the scraped page actually contains our product identifiers.
This prevents false positives where a retailer redirects to a different product.

Parameters:

  • check (CatalogItemRetailerProbe)

    The probe record to update

  • content (String)

    HTML content (or URL for URL-based validation)

  • catalog_item (CatalogItem)

    The catalog item being checked

Returns:

  • (Boolean)

    true if validation passed, false if product mismatch detected



219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
# File 'app/services/retailer/extractors/base.rb', line 219

def validate_product_identity(check, content, catalog_item)
  identifiers = collect_product_identifiers(catalog_item)
  return true if identifiers.empty? # Skip validation if no identifiers available

  # Check if ANY of our identifiers appear in the page content or URL
  content_to_check = content.to_s.downcase
  url_to_check = check.url.to_s.downcase

  found = identifiers.any? do |identifier|
    next false if identifier.blank?

    normalized = identifier.to_s.downcase.strip
    content_to_check.include?(normalized) || url_to_check.include?(normalized)
  end

  unless found
    check.status = 'product_mismatch'
    check.error_message = "Product identity validation failed: none of our identifiers (#{identifiers.compact.join(', ')}) found on page"
    Rails.logger.warn "[#{source_name}] Product mismatch for catalog_item #{catalog_item.id}: #{check.error_message}"
  end

  found
end