Class: Retailer::Extractors::Wayfair
- Inherits:
-
Base
- Object
- Base
- Retailer::Extractors::Wayfair
- Defined in:
- app/services/retailer/extractors/wayfair.rb
Overview
Wayfair data extractor.
Uses data-test-id attributes for reliable price extraction.
Wayfair Variant Handling:
When searching by internal SKU (e.g., TCT240-3.7W-749-FS), Wayfair redirects
to the parent product page with URL params like ?redir=SKU&piid=123,456.
The page initially shows the LOWEST variant price, then JavaScript updates
the selection based on URL parameters. We use browser_instructions to wait
for the variant selection to complete before extracting the price.
Constant Summary collapse
- RENDER_REQUIRED =
Wayfair pricing is JS-driven (variant selection via URL params runs after
initial page load). browser_instructions below also assume rendering, so
this MUST stay true. true
Class Method Summary collapse
-
.browser_instructions ⇒ Array<Hash>
Browser instructions to wait for Wayfair's variant selection to complete.
-
.build_payload(url:, geo_location: nil) ⇒ Hash
Build Oxylabs payload for Wayfair product scraping Uses 'universal' source with JS rendering and browser_instructions to wait for variant-specific pricing to load.
Instance Method Summary collapse
- #extract(check, content) ⇒ Object
-
#validate_product_identity(check, content, catalog_item) ⇒ Boolean
Wayfair never shows our manufacturer SKU, so the base identity check (which looks for our SKU/UPC on the page) always fails for it.
Class Method Details
.browser_instructions ⇒ Array<Hash>
Browser instructions to wait for Wayfair's variant selection to complete.
Wayfair uses JavaScript to update pricing based on URL params (redir, piid).
We wait for the price element to stabilize after redirect/variant selection.
Reference: https://github.com/oxylabs/how-to-scrape-wayfair
47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 |
# File 'app/services/retailer/extractors/wayfair.rb', line 47 def self.browser_instructions [ # Wait for initial page load and price element to appear # Primary selector from data-test-id (most reliable) { type: 'wait_for_element', selector: { type: 'css', value: '[data-test-id="PriceDisplay"]' }, timeout_s: 10 }, # Additional wait for variant selection JavaScript to complete # Wayfair's redirect/variant selection takes ~2-5 seconds { type: 'wait', wait_time_s: 5 } ] end |
.build_payload(url:, geo_location: nil) ⇒ Hash
Build Oxylabs payload for Wayfair product scraping
Uses 'universal' source with JS rendering and browser_instructions
to wait for variant-specific pricing to load.
Reference: https://github.com/oxylabs/how-to-scrape-wayfair
27 28 29 30 31 32 33 34 35 36 37 38 39 |
# File 'app/services/retailer/extractors/wayfair.rb', line 27 def self.build_payload(url:, geo_location: nil) { source: 'universal', url: url, render: render_value, user_agent_type: 'desktop_safari', geo_location: geo_location || 'United States', context: [ { key: 'follow_redirects', value: true } ], browser_instructions: browser_instructions }.compact end |
Instance Method Details
#extract(check, content) ⇒ Object
65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 |
# File 'app/services/retailer/extractors/wayfair.rb', line 65 def extract(check, content) return unless valid_html?(content) check.scraper_source = source_name check.currency = catalog.id == WAYFAIR_CANADA ? 'CAD' : 'USD' doc = parse_html(content) # Check availability check.product_available = doc.at_css('[data-test-id="AddToCartButton"]').present? || content.exclude?('Out of Stock') # IMPORTANT: Scope price extraction to the main product pricing section only. # Wayfair's "Compare Similar Items" carousel + sponsored ads reuse the same # PriceDisplay markup, so anything outside this container risks a wrong price. pricing_section = find_main_pricing_section(doc) if pricing_section # Sale price: data-test-id="StandardPricingPrice-SALE" (when on sale) # Primary price: data-test-id="StandardPricingPrice-PRIMARY" (otherwise) extract_current_price(check, pricing_section) # Original/was price: data-test-id="StandardPricingPrice-PREVIOUS" extract_previous_price(check, pricing_section) # Fallback: first PriceDisplay WITHIN the main pricing section extract_fallback_prices(check, pricing_section) if check.price.blank? end # Last fallback: JSON-LD schema.org offers (structured, main-product scoped) extract_json_ld_price(check, doc) if check.price.blank? end |
#validate_product_identity(check, content, catalog_item) ⇒ Boolean
Wayfair never shows our manufacturer SKU, so the base identity check (which
looks for our SKU/UPC on the page) always fails for it. The stored URL is the
Wayfair Catalog API's canonical PDP, so identity is confirmed by the page
carrying the URL's Wayfair SKU — which also catches a redirect to a different
product. Falls back to the base check when the URL has no Wayfair SKU.
107 108 109 110 111 112 113 114 115 116 117 118 |
# File 'app/services/retailer/extractors/wayfair.rb', line 107 def validate_product_identity(check, content, catalog_item) wf_sku = wayfair_sku_from_url(check.url) return super if wf_sku.blank? # Match case-insensitively: the URL may carry a lowercase SKU (slugged # redirect target) while the page prints it upper-cased, or vice versa. return true if content.to_s.match?(/#{Regexp.escape(wf_sku)}/i) check.status = 'product_mismatch' check. = "Wayfair SKU #{wf_sku} (from URL) not found on page" Rails.logger.warn "[#{source_name}] Wayfair mismatch for catalog_item #{catalog_item.id}: #{check.}" false end |