Class: Retailer::Extractors::Wayfair

Inherits:
Base
  • Object
show all
Defined in:
app/services/retailer/extractors/wayfair.rb

Overview

Wayfair data extractor.
Uses data-test-id attributes for reliable price extraction.

Wayfair Variant Handling:
When searching by internal SKU (e.g., TCT240-3.7W-749-FS), Wayfair redirects
to the parent product page with URL params like ?redir=SKU&piid=123,456.
The page initially shows the LOWEST variant price, then JavaScript updates
the selection based on URL parameters. We use browser_instructions to wait
for the variant selection to complete before extracting the price.

Constant Summary collapse

RENDER_REQUIRED =

Wayfair pricing is JS-driven (variant selection via URL params runs after
initial page load). browser_instructions below also assume rendering, so
this MUST stay true.

true

Class Method Summary collapse

Instance Method Summary collapse

Class Method Details

.browser_instructionsArray<Hash>

Browser instructions to wait for Wayfair's variant selection to complete.
Wayfair uses JavaScript to update pricing based on URL params (redir, piid).
We wait for the price element to stabilize after redirect/variant selection.
Reference: https://github.com/oxylabs/how-to-scrape-wayfair

Returns:

  • (Array<Hash>)

    Oxylabs browser instructions



47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
# File 'app/services/retailer/extractors/wayfair.rb', line 47

def self.browser_instructions
  [
    # Wait for initial page load and price element to appear
    # Primary selector from data-test-id (most reliable)
    {
      type: 'wait_for_element',
      selector: {
        type: 'css',
        value: '[data-test-id="PriceDisplay"]'
      },
      timeout_s: 10
    },
    # Additional wait for variant selection JavaScript to complete
    # Wayfair's redirect/variant selection takes ~2-5 seconds
    { type: 'wait', wait_time_s: 5 }
  ]
end

.build_payload(url:, geo_location: nil) ⇒ Hash

Build Oxylabs payload for Wayfair product scraping
Uses 'universal' source with JS rendering and browser_instructions
to wait for variant-specific pricing to load.
Reference: https://github.com/oxylabs/how-to-scrape-wayfair

Parameters:

  • url (String)

    Full product URL

  • geo_location (String, nil) (defaults to: nil)

    Country for pricing (default: United States)

Returns:

  • (Hash)

    Oxylabs API payload



27
28
29
30
31
32
33
34
35
36
37
38
39
# File 'app/services/retailer/extractors/wayfair.rb', line 27

def self.build_payload(url:, geo_location: nil)
  {
    source: 'universal',
    url: url,
    render: render_value,
    user_agent_type: 'desktop_safari',
    geo_location: geo_location || 'United States',
    context: [
      { key: 'follow_redirects', value: true }
    ],
    browser_instructions: browser_instructions
  }.compact
end

Instance Method Details

#extract(check, content) ⇒ Object



65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
# File 'app/services/retailer/extractors/wayfair.rb', line 65

def extract(check, content)
  return unless valid_html?(content)

  check.scraper_source = source_name
  check.currency = catalog.id == WAYFAIR_CANADA ? 'CAD' : 'USD'

  doc = parse_html(content)

  # Check availability
  check.product_available = doc.at_css('[data-test-id="AddToCartButton"]').present? ||
                            content.exclude?('Out of Stock')

  # IMPORTANT: Scope price extraction to the main product pricing section only.
  # Wayfair's "Compare Similar Items" carousel + sponsored ads reuse the same
  # PriceDisplay markup, so anything outside this container risks a wrong price.
  pricing_section = find_main_pricing_section(doc)
  if pricing_section
    # Sale price: data-test-id="StandardPricingPrice-SALE" (when on sale)
    # Primary price: data-test-id="StandardPricingPrice-PRIMARY" (otherwise)
    extract_current_price(check, pricing_section)

    # Original/was price: data-test-id="StandardPricingPrice-PREVIOUS"
    extract_previous_price(check, pricing_section)

    # Fallback: first PriceDisplay WITHIN the main pricing section
    extract_fallback_prices(check, pricing_section) if check.price.blank?
  end

  # Last fallback: JSON-LD schema.org offers (structured, main-product scoped)
  extract_json_ld_price(check, doc) if check.price.blank?
end

#validate_product_identity(check, content, catalog_item) ⇒ Boolean

Wayfair never shows our manufacturer SKU, so the base identity check (which
looks for our SKU/UPC on the page) always fails for it. The stored URL is the
Wayfair Catalog API's canonical PDP, so identity is confirmed by the page
carrying the URL's Wayfair SKU — which also catches a redirect to a different
product. Falls back to the base check when the URL has no Wayfair SKU.

Parameters:

Returns:

  • (Boolean)


107
108
109
110
111
112
113
114
115
116
117
118
# File 'app/services/retailer/extractors/wayfair.rb', line 107

def validate_product_identity(check, content, catalog_item)
  wf_sku = wayfair_sku_from_url(check.url)
  return super if wf_sku.blank?
  # Match case-insensitively: the URL may carry a lowercase SKU (slugged
  # redirect target) while the page prints it upper-cased, or vice versa.
  return true if content.to_s.match?(/#{Regexp.escape(wf_sku)}/i)

  check.status = 'product_mismatch'
  check.error_message = "Wayfair SKU #{wf_sku} (from URL) not found on page"
  Rails.logger.warn "[#{source_name}] Wayfair mismatch for catalog_item #{catalog_item.id}: #{check.error_message}"
  false
end