Class: Retailer::OxylabsApi
- Inherits:
-
Object
- Object
- Retailer::OxylabsApi
- Defined in:
- app/services/retailer/oxylabs_api.rb
Overview
Oxylabs Web Scraper API client - Pure HTTP client for Oxylabs API.
All retailer-specific logic lives in Retailer::Extractors::* classes.
This class handles:
- Realtime (synchronous) requests
- Push-Pull (async) batch processing
- Job polling and result retrieval
Defined Under Namespace
Constant Summary collapse
- REALTIME_ENDPOINT =
Realtime endpoint (synchronous - waits for result)
'https://realtime.oxylabs.io/v1/queries'- ASYNC_ENDPOINT =
Push-Pull endpoints (asynchronous - submit jobs, retrieve later)
'https://data.oxylabs.io/v1/queries'- ASYNC_RESULTS_ENDPOINT =
- /job_id/results
'https://data.oxylabs.io/v1/queries'- DEFAULT_TIMEOUT =
seconds
60- BATCH_TIMEOUT =
seconds for batch submissions
30- DEFAULT_SUBMIT_CONCURRENCY =
Max concurrent async submissions. Oxylabs throttles large bursts — a
50-wide fan-out had ~70% of submissions rejected in production — so we cap
in-flight submissions and send the rest in waves. Override per-env via
Heatwave::Configuration.fetch(:oxylabs, :submit_concurrency). 5- MAX_SUBMIT_ATTEMPTS =
Submission attempts (1 initial try + retries) for transient responses.
3- RETRYABLE_SUBMIT_STATUSES =
HTTP statuses worth retrying a submission on: rate-limit + transient 5xx.
[429, 500, 502, 503, 504].freeze
Class Method Summary collapse
-
.html_content(result) ⇒ String?
Extract the HTML content from a result.
-
.parsed_content(result) ⇒ Hash?
Extract parsed content from a result (when parse: true).
Instance Method Summary collapse
-
#amazon_product(asin:, domain:, geo_location: nil, parse: false) ⇒ Result
Scrape Amazon product page by ASIN.
-
#amazon_url(url:, render: true) ⇒ Result
Scrape any Amazon page by URL (search, category, bestsellers, product, etc.) Uses Oxylabs' dedicated Amazon infrastructure without requiring an ASIN.
-
#costco_product(url:, geo_location: nil) ⇒ Result
Scrape Costco product page.
-
#home_depot_product(url:, geo_location: nil) ⇒ Result
Scrape Home Depot product page.
-
#initialize(options = {}) ⇒ OxylabsApi
constructor
A new instance of OxylabsApi.
-
#job_results(job_id) ⇒ Result
Retrieve job results (when status is 'done').
-
#job_status(job_id) ⇒ JobResult
Check job status.
-
#poll_for_results(job_id, max_attempts: 30, interval: 2) ⇒ Result
Poll for job completion and retrieve results.
-
#request(payload) ⇒ Result
Make a synchronous (realtime) request to the Oxylabs API Blocks until result is returned.
-
#scrape_url(url:, render: true) ⇒ Result
Scrape any URL with universal source.
-
#submit_job(payload, callback_url: nil) ⇒ JobResult
Submit a job asynchronously (Push-Pull method) Returns immediately with a job_id for later retrieval.
-
#submit_jobs_batch(payloads) ⇒ Array<JobResult>
Submit multiple jobs in parallel Each payload may already contain its own callback_url for per-job tracking.
Constructor Details
#initialize(options = {}) ⇒ OxylabsApi
Returns a new instance of OxylabsApi.
60 61 62 63 64 65 |
# File 'app/services/retailer/oxylabs_api.rb', line 60 def initialize( = {}) @username = [:username] || Heatwave::Configuration.fetch(:oxylabs, :api_username) @password = [:password] || Heatwave::Configuration.fetch(:oxylabs, :api_password) @timeout = [:timeout] || DEFAULT_TIMEOUT @logger = [:logger] || Rails.logger end |
Class Method Details
.html_content(result) ⇒ String?
Extract the HTML content from a result
289 290 291 292 293 |
# File 'app/services/retailer/oxylabs_api.rb', line 289 def self.html_content(result) return nil unless result.success? result.data&.first&.dig('content').to_s end |
.parsed_content(result) ⇒ Hash?
Extract parsed content from a result (when parse: true)
298 299 300 301 302 303 |
# File 'app/services/retailer/oxylabs_api.rb', line 298 def self.parsed_content(result) return nil unless result.success? content = result.data&.first&.dig('content') content.is_a?(Hash) ? content : nil end |
Instance Method Details
#amazon_product(asin:, domain:, geo_location: nil, parse: false) ⇒ Result
Scrape Amazon product page by ASIN
240 241 242 243 244 245 246 247 248 249 |
# File 'app/services/retailer/oxylabs_api.rb', line 240 def amazon_product(asin:, domain:, geo_location: nil, parse: false) payload = { source: 'amazon_product', domain: domain, query: asin, parse: parse } payload[:geo_location] = geo_location if geo_location.present? request(payload) end |
#amazon_url(url:, render: true) ⇒ Result
Scrape any Amazon page by URL (search, category, bestsellers, product, etc.)
Uses Oxylabs' dedicated Amazon infrastructure without requiring an ASIN.
256 257 258 259 260 261 262 |
# File 'app/services/retailer/oxylabs_api.rb', line 256 def amazon_url(url:, render: true) request( source: 'amazon', url: url, render: render ? 'html' : nil ) end |
#costco_product(url:, geo_location: nil) ⇒ Result
Scrape Costco product page
277 278 279 280 |
# File 'app/services/retailer/oxylabs_api.rb', line 277 def costco_product(url:, geo_location: nil) payload = Retailer::Extractors::Costco.build_payload(url: url, geo_location: geo_location) request(payload) end |
#home_depot_product(url:, geo_location: nil) ⇒ Result
Scrape Home Depot product page
268 269 270 271 |
# File 'app/services/retailer/oxylabs_api.rb', line 268 def home_depot_product(url:, geo_location: nil) payload = Retailer::Extractors::HomeDepot.build_payload(url: url, geo_location: geo_location) request(payload) end |
#job_results(job_id) ⇒ Result
Retrieve job results (when status is 'done')
177 178 179 180 181 182 183 184 185 186 187 188 189 |
# File 'app/services/retailer/oxylabs_api.rb', line 177 def job_results(job_id) @logger.info "[Oxylabs] Retrieving results for job: #{job_id}" response = http_client .timeout(@timeout) .basic_auth(user: @username, pass: @password) .get("#{ASYNC_RESULTS_ENDPOINT}/#{job_id}/results") handle_response(response) rescue HTTP::Error => e @logger.error "[Oxylabs] Results retrieval error: #{e.}" Result.new(success: false, data: nil, error: e., status_code: nil, raw_response: nil) end |
#job_status(job_id) ⇒ JobResult
Check job status
162 163 164 165 166 167 168 169 170 171 172 |
# File 'app/services/retailer/oxylabs_api.rb', line 162 def job_status(job_id) response = http_client .timeout(BATCH_TIMEOUT) .basic_auth(user: @username, pass: @password) .get("#{ASYNC_ENDPOINT}/#{job_id}") handle_job_status(response, job_id) rescue HTTP::Error => e @logger.error "[Oxylabs] Job status error: #{e.}" JobResult.new(success: false, job_id: job_id, error: e., status: 'error') end |
#poll_for_results(job_id, max_attempts: 30, interval: 2) ⇒ Result
Poll for job completion and retrieve results
196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 |
# File 'app/services/retailer/oxylabs_api.rb', line 196 def poll_for_results(job_id, max_attempts: 30, interval: 2) attempts = 0 loop do attempts += 1 status = job_status(job_id) case status.status when 'done' return job_results(job_id) when 'faulted' return Result.new(success: false, data: nil, error: 'Job faulted', status_code: nil, raw_response: nil) when 'pending' return Result.new(success: false, data: nil, error: 'Polling timeout', status_code: nil, raw_response: nil) if attempts >= max_attempts sleep(interval) else return Result.new(success: false, data: nil, error: "Unknown status: #{status.status}", status_code: nil, raw_response: nil) end end end |
#request(payload) ⇒ Result
Make a synchronous (realtime) request to the Oxylabs API
Blocks until result is returned. Best for single "Probe Now" requests.
75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 |
# File 'app/services/retailer/oxylabs_api.rb', line 75 def request(payload) # Realtime is a single fire-once query (manual "Probe Now"), so counting on # attempt is fine here. The async batch path instead counts only accepted # submissions (see #submit_job) because large bursts get throttled before # they ever run. Retailer::DailyCostGuard.check_and_increment! @logger.info "[Oxylabs] Realtime request: #{payload.except(:context).to_json}" response = http_client .timeout(@timeout) .basic_auth(user: @username, pass: @password) .post(REALTIME_ENDPOINT, json: payload) handle_response(response) rescue Retailer::DailyCostGuard::BudgetExceeded => e Result.new(success: false, data: nil, error: e., status_code: nil, raw_response: nil) rescue HTTP::Error => e @logger.error "[Oxylabs] HTTP Error: #{e.}" Result.new(success: false, data: nil, error: e., status_code: nil, raw_response: nil) rescue StandardError => e @logger.error "[Oxylabs] Error: #{e.class} - #{e.}" Result.new(success: false, data: nil, error: e., status_code: nil, raw_response: nil) end |
#scrape_url(url:, render: true) ⇒ Result
Scrape any URL with universal source
226 227 228 229 230 231 232 |
# File 'app/services/retailer/oxylabs_api.rb', line 226 def scrape_url(url:, render: true) request( source: 'universal', url: url, render: render ? 'html' : nil ) end |
#submit_job(payload, callback_url: nil) ⇒ JobResult
Submit a job asynchronously (Push-Pull method)
Returns immediately with a job_id for later retrieval.
Best for batch processing many items.
111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 |
# File 'app/services/retailer/oxylabs_api.rb', line 111 def submit_job(payload, callback_url: nil) # Pre-flight ceiling only — do NOT count yet. Spend is recorded once # Oxylabs accepts the job, so rate-limited/rejected submissions don't burn # the daily budget. A runaway loop is still blocked once the count hits the # ceiling. Retailer::DailyCostGuard.ensure_within_budget! # Add callback_url if provided (for webhook delivery instead of polling) payload = payload.merge(callback_url: callback_url) if callback_url.present? @logger.info "[Oxylabs] Submitting async job: #{payload.except(:context).to_json}" response = post_async_with_retry(payload) result = handle_job_submission(response) # Only accepted submissions cost money — count those, not the rejects. Retailer::DailyCostGuard.record! if result.success? result rescue Retailer::DailyCostGuard::BudgetExceeded => e JobResult.new(success: false, job_id: nil, error: e., status: 'error') rescue HTTP::Error => e @logger.error "[Oxylabs] Async submit error: #{e.}" JobResult.new(success: false, job_id: nil, error: e., status: 'error') rescue StandardError => e @logger.error "[Oxylabs] Async error: #{e.class} - #{e.}" JobResult.new(success: false, job_id: nil, error: e., status: 'error') end |
#submit_jobs_batch(payloads) ⇒ Array<JobResult>
Submit multiple jobs in parallel
Each payload may already contain its own callback_url for per-job tracking
144 145 146 147 148 149 150 151 152 153 154 155 156 157 |
# File 'app/services/retailer/oxylabs_api.rb', line 144 def submit_jobs_batch(payloads) concurrency = submit_concurrency @logger.info "[Oxylabs] Submitting #{payloads.size} async jobs (#{concurrency} at a time)" # Cap in-flight submissions: fan out one wave of `concurrency` at a time # rather than one thread per payload. The barrier between waves keeps the # burst under Oxylabs' submission rate limit (an unbounded fan-out had most # submissions throttled). Results stay in input order. payloads.each_slice(concurrency).flat_map do |wave| Heatwave::Parallel.map(wave) do |payload| submit_job(payload.except(:callback_url), callback_url: payload[:callback_url]) end end end |