Class: Retailer::Extractors::Wayfair

Inherits:

Base

Object
Base
Retailer::Extractors::Wayfair

show all

Defined in:: app/services/retailer/extractors/wayfair.rb

Overview

Wayfair data extractor.
Uses data-test-id attributes for reliable price extraction.

Wayfair Variant Handling:
When searching by internal SKU (e.g., TCT240-3.7W-749-FS), Wayfair redirects
to the parent product page with URL params like ?redir=SKU&piid=123,456.
The page initially shows the LOWEST variant price, then JavaScript updates
the selection based on URL parameters. We use browser_instructions to wait
for the variant selection to complete before extracting the price.

Constant Summary collapse

RENDER_REQUIRED = Wayfair pricing is JS-driven (variant selection via URL params runs after initial page load). browser_instructions below also assume rendering, so this MUST stay true.

true

Class Method Summary collapse

.browser_instructions ⇒ Array<Hash>
Browser instructions to wait for Wayfair's variant selection to complete.
.build_payload(url:, geo_location: nil) ⇒ Hash
Build Oxylabs payload for Wayfair product scraping Uses 'universal' source with JS rendering and browser_instructions to wait for variant-specific pricing to load.

Instance Method Summary collapse

#extract(check, content) ⇒ Object
#validate_product_identity(check, content, catalog_item) ⇒ Boolean
Wayfair never shows our manufacturer SKU, so the base identity check (which looks for our SKU/UPC on the page) always fails for it.

Class Method Details

.browser_instructions ⇒ `Array<Hash>`

Browser instructions to wait for Wayfair's variant selection to complete.
Wayfair uses JavaScript to update pricing based on URL params (redir, piid).
We wait for the price element to stabilize after redirect/variant selection.
Reference: https://github.com/oxylabs/how-to-scrape-wayfair

Returns:

(Array<Hash>) —
Oxylabs browser instructions

# File 'app/services/retailer/extractors/wayfair.rb', line 47

def self.browser_instructions
  [
    # Wait for initial page load and price element to appear
    # Primary selector from data-test-id (most reliable)
    {
      type: 'wait_for_element',
      selector: {
        type: 'css',
        value: '[data-test-id="PriceDisplay"]'
      },
      timeout_s: 10
    },
    # Additional wait for variant selection JavaScript to complete
    # Wayfair's redirect/variant selection takes ~2-5 seconds
    { type: 'wait', wait_time_s: 5 }
  ]
end

.build_payload(url:, geo_location: nil) ⇒ `Hash`

Build Oxylabs payload for Wayfair product scraping
Uses 'universal' source with JS rendering and browser_instructions
to wait for variant-specific pricing to load.
Reference: https://github.com/oxylabs/how-to-scrape-wayfair

Parameters:

url (String) —
Full product URL
geo_location (String, nil) (defaults to: nil) —
Country for pricing (default: United States)

Returns:

(Hash) —
Oxylabs API payload

# File 'app/services/retailer/extractors/wayfair.rb', line 27

def self.build_payload(url:, geo_location: nil)
  {
    source: 'universal',
    url: url,
    render: render_value,
    user_agent_type: 'desktop_safari',
    geo_location: geo_location || 'United States',
    context: [
      { key: 'follow_redirects', value: true }
    ],
    browser_instructions: browser_instructions
  }.compact
end

Instance Method Details

#extract(check, content) ⇒ `Object`

# File 'app/services/retailer/extractors/wayfair.rb', line 65

def extract(check, content)
  return unless valid_html?(content)

  check.scraper_source = source_name
  check.currency = catalog.id == WAYFAIR_CANADA ? 'CAD' : 'USD'

  doc = parse_html(content)

  # Check availability
  check.product_available = doc.at_css('[data-test-id="AddToCartButton"]').present? ||
                            content.exclude?('Out of Stock')

  # IMPORTANT: Scope price extraction to the main product pricing section only.
  # Wayfair's "Compare Similar Items" carousel + sponsored ads reuse the same
  # PriceDisplay markup, so anything outside this container risks a wrong price.
  pricing_section = find_main_pricing_section(doc)
  if pricing_section
    # Sale price: data-test-id="StandardPricingPrice-SALE" (when on sale)
    # Primary price: data-test-id="StandardPricingPrice-PRIMARY" (otherwise)
    extract_current_price(check, pricing_section)

    # Original/was price: data-test-id="StandardPricingPrice-PREVIOUS"
    extract_previous_price(check, pricing_section)

    # Fallback: first PriceDisplay WITHIN the main pricing section
    extract_fallback_prices(check, pricing_section) if check.price.blank?
  end

  # Last fallback: JSON-LD schema.org offers (structured, main-product scoped)
  extract_json_ld_price(check, doc) if check.price.blank?
end

#validate_product_identity(check, content, catalog_item) ⇒ `Boolean`

Wayfair never shows our manufacturer SKU, so the base identity check (which
looks for our SKU/UPC on the page) always fails for it. The stored URL is the
Wayfair Catalog API's canonical PDP, so identity is confirmed by the page
carrying the URL's Wayfair SKU — which also catches a redirect to a different
product. Falls back to the base check when the URL has no Wayfair SKU.

Parameters:

check (CatalogItemRetailerProbe)
content (String) —
page HTML
catalog_item (CatalogItem)

Returns:

(Boolean)

# File 'app/services/retailer/extractors/wayfair.rb', line 107

def validate_product_identity(check, content, catalog_item)
  wf_sku = wayfair_sku_from_url(check.url)
  return super if wf_sku.blank?
  # Match case-insensitively: the URL may carry a lowercase SKU (slugged
  # redirect target) while the page prints it upper-cased, or vice versa.
  return true if content.to_s.match?(/#{Regexp.escape(wf_sku)}/i)

  check.status = 'product_mismatch'
  check.error_message = "Wayfair SKU #{wf_sku} (from URL) not found on page"
  Rails.logger.warn "[#{source_name}] Wayfair mismatch for catalog_item #{catalog_item.id}: #{check.error_message}"
  false
end

Class: Retailer::Extractors::Wayfair

Overview

Constant Summary collapse

Class Method Summary collapse

Instance Method Summary collapse

Class Method Details

.browser_instructions ⇒ Array<Hash>

.build_payload(url:, geo_location: nil) ⇒ Hash

Instance Method Details

#extract(check, content) ⇒ Object

#validate_product_identity(check, content, catalog_item) ⇒ Boolean

.browser_instructions ⇒ `Array<Hash>`

.build_payload(url:, geo_location: nil) ⇒ `Hash`

#extract(check, content) ⇒ `Object`

#validate_product_identity(check, content, catalog_item) ⇒ `Boolean`