Class: Retailer::OxylabsApi

Inherits:
Object
  • Object
show all
Defined in:
app/services/retailer/oxylabs_api.rb

Overview

Oxylabs Web Scraper API client - Pure HTTP client for Oxylabs API.
All retailer-specific logic lives in Retailer::Extractors::* classes.

This class handles:

  • Realtime (synchronous) requests
  • Push-Pull (async) batch processing
  • Job polling and result retrieval

Examples:

Basic synchronous request

api = Retailer::OxylabsApi.new
result = api.request(source: 'universal', url: 'https://example.com', render: 'html')

Async batch processing

api = Retailer::OxylabsApi.new
job = api.submit_job(payload)
result = api.poll_for_results(job.job_id)

Defined Under Namespace

Classes: JobResult, Result

Constant Summary collapse

REALTIME_ENDPOINT =

Realtime endpoint (synchronous - waits for result)

'https://realtime.oxylabs.io/v1/queries'
ASYNC_ENDPOINT =

Push-Pull endpoints (asynchronous - submit jobs, retrieve later)

'https://data.oxylabs.io/v1/queries'
ASYNC_RESULTS_ENDPOINT =
  • /job_id/results
'https://data.oxylabs.io/v1/queries'
DEFAULT_TIMEOUT =

seconds

60
BATCH_TIMEOUT =

seconds for batch submissions

30
DEFAULT_SUBMIT_CONCURRENCY =

Max concurrent async submissions. Oxylabs throttles large bursts — a
50-wide fan-out had ~70% of submissions rejected in production — so we cap
in-flight submissions and send the rest in waves. Override per-env via
Heatwave::Configuration.fetch(:oxylabs, :submit_concurrency).

5
MAX_SUBMIT_ATTEMPTS =

Submission attempts (1 initial try + retries) for transient responses.

3
RETRYABLE_SUBMIT_STATUSES =

HTTP statuses worth retrying a submission on: rate-limit + transient 5xx.

[429, 500, 502, 503, 504].freeze

Class Method Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(options = {}) ⇒ OxylabsApi

Returns a new instance of OxylabsApi.



60
61
62
63
64
65
# File 'app/services/retailer/oxylabs_api.rb', line 60

def initialize(options = {})
  @username = options[:username] || Heatwave::Configuration.fetch(:oxylabs, :api_username)
  @password = options[:password] || Heatwave::Configuration.fetch(:oxylabs, :api_password)
  @timeout = options[:timeout] || DEFAULT_TIMEOUT
  @logger = options[:logger] || Rails.logger
end

Class Method Details

.html_content(result) ⇒ String?

Extract the HTML content from a result

Parameters:

  • result (Result)

    API result

Returns:

  • (String, nil)


289
290
291
292
293
# File 'app/services/retailer/oxylabs_api.rb', line 289

def self.html_content(result)
  return nil unless result.success?

  result.data&.first&.dig('content').to_s
end

.parsed_content(result) ⇒ Hash?

Extract parsed content from a result (when parse: true)

Parameters:

  • result (Result)

    API result

Returns:

  • (Hash, nil)


298
299
300
301
302
303
# File 'app/services/retailer/oxylabs_api.rb', line 298

def self.parsed_content(result)
  return nil unless result.success?

  content = result.data&.first&.dig('content')
  content.is_a?(Hash) ? content : nil
end

Instance Method Details

#amazon_product(asin:, domain:, geo_location: nil, parse: false) ⇒ Result

Scrape Amazon product page by ASIN

Parameters:

  • asin (String)

    Amazon ASIN

  • domain (String)

    Amazon domain (e.g., 'amazon.com', 'amazon.ca')

  • geo_location (String, nil) (defaults to: nil)

    ZIP/postal code

  • parse (Boolean) (defaults to: false)

    Whether to use Oxylabs' built-in parser

Returns:



240
241
242
243
244
245
246
247
248
249
# File 'app/services/retailer/oxylabs_api.rb', line 240

def amazon_product(asin:, domain:, geo_location: nil, parse: false)
  payload = {
    source: 'amazon_product',
    domain: domain,
    query: asin,
    parse: parse
  }
  payload[:geo_location] = geo_location if geo_location.present?
  request(payload)
end

#amazon_url(url:, render: true) ⇒ Result

Scrape any Amazon page by URL (search, category, bestsellers, product, etc.)
Uses Oxylabs' dedicated Amazon infrastructure without requiring an ASIN.

Parameters:

  • url (String)

    Full Amazon URL

  • render (Boolean) (defaults to: true)

    Whether to render JavaScript (default: true)

Returns:



256
257
258
259
260
261
262
# File 'app/services/retailer/oxylabs_api.rb', line 256

def amazon_url(url:, render: true)
  request(
    source: 'amazon',
    url: url,
    render: render ? 'html' : nil
  )
end

#costco_product(url:, geo_location: nil) ⇒ Result

Scrape Costco product page

Parameters:

  • url (String)

    Product URL

  • geo_location (Object, nil) (defaults to: nil)

Returns:



277
278
279
280
# File 'app/services/retailer/oxylabs_api.rb', line 277

def costco_product(url:, geo_location: nil)
  payload = Retailer::Extractors::Costco.build_payload(url: url, geo_location: geo_location)
  request(payload)
end

#home_depot_product(url:, geo_location: nil) ⇒ Result

Scrape Home Depot product page

Parameters:

  • url (String)

    Product URL

  • geo_location (String, nil) (defaults to: nil)

    ZIP code

Returns:



268
269
270
271
# File 'app/services/retailer/oxylabs_api.rb', line 268

def home_depot_product(url:, geo_location: nil)
  payload = Retailer::Extractors::HomeDepot.build_payload(url: url, geo_location: geo_location)
  request(payload)
end

#job_results(job_id) ⇒ Result

Retrieve job results (when status is 'done')

Parameters:

  • job_id (String)

    Job ID from submit_job

Returns:

  • (Result)

    Same format as realtime results



177
178
179
180
181
182
183
184
185
186
187
188
189
# File 'app/services/retailer/oxylabs_api.rb', line 177

def job_results(job_id)
  @logger.info "[Oxylabs] Retrieving results for job: #{job_id}"

  response = http_client
             .timeout(@timeout)
             .basic_auth(user: @username, pass: @password)
             .get("#{ASYNC_RESULTS_ENDPOINT}/#{job_id}/results")

  handle_response(response)
rescue HTTP::Error => e
  @logger.error "[Oxylabs] Results retrieval error: #{e.message}"
  Result.new(success: false, data: nil, error: e.message, status_code: nil, raw_response: nil)
end

#job_status(job_id) ⇒ JobResult

Check job status

Parameters:

  • job_id (String)

    Job ID from submit_job

Returns:



162
163
164
165
166
167
168
169
170
171
172
# File 'app/services/retailer/oxylabs_api.rb', line 162

def job_status(job_id)
  response = http_client
             .timeout(BATCH_TIMEOUT)
             .basic_auth(user: @username, pass: @password)
             .get("#{ASYNC_ENDPOINT}/#{job_id}")

  handle_job_status(response, job_id)
rescue HTTP::Error => e
  @logger.error "[Oxylabs] Job status error: #{e.message}"
  JobResult.new(success: false, job_id: job_id, error: e.message, status: 'error')
end

#poll_for_results(job_id, max_attempts: 30, interval: 2) ⇒ Result

Poll for job completion and retrieve results

Parameters:

  • job_id (String)

    Job ID

  • max_attempts (Integer) (defaults to: 30)

    Maximum polling attempts (default: 30)

  • interval (Integer) (defaults to: 2)

    Seconds between polls (default: 2)

Returns:

  • (Result)

    Final results or error



196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
# File 'app/services/retailer/oxylabs_api.rb', line 196

def poll_for_results(job_id, max_attempts: 30, interval: 2)
  attempts = 0

  loop do
    attempts += 1
    status = job_status(job_id)

    case status.status
    when 'done'
      return job_results(job_id)
    when 'faulted'
      return Result.new(success: false, data: nil, error: 'Job faulted', status_code: nil, raw_response: nil)
    when 'pending'
      return Result.new(success: false, data: nil, error: 'Polling timeout', status_code: nil, raw_response: nil) if attempts >= max_attempts

      sleep(interval)
    else
      return Result.new(success: false, data: nil, error: "Unknown status: #{status.status}", status_code: nil, raw_response: nil)
    end
  end
end

#request(payload) ⇒ Result

Make a synchronous (realtime) request to the Oxylabs API
Blocks until result is returned. Best for single "Probe Now" requests.

Parameters:

  • payload (Hash)

    Request payload

Returns:



75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
# File 'app/services/retailer/oxylabs_api.rb', line 75

def request(payload)
  # Realtime is a single fire-once query (manual "Probe Now"), so counting on
  # attempt is fine here. The async batch path instead counts only accepted
  # submissions (see #submit_job) because large bursts get throttled before
  # they ever run.
  Retailer::DailyCostGuard.check_and_increment!

  @logger.info "[Oxylabs] Realtime request: #{payload.except(:context).to_json}"

  response = http_client
             .timeout(@timeout)
             .basic_auth(user: @username, pass: @password)
             .post(REALTIME_ENDPOINT, json: payload)

  handle_response(response)
rescue Retailer::DailyCostGuard::BudgetExceeded => e
  Result.new(success: false, data: nil, error: e.message, status_code: nil, raw_response: nil)
rescue HTTP::Error => e
  @logger.error "[Oxylabs] HTTP Error: #{e.message}"
  Result.new(success: false, data: nil, error: e.message, status_code: nil, raw_response: nil)
rescue StandardError => e
  @logger.error "[Oxylabs] Error: #{e.class} - #{e.message}"
  Result.new(success: false, data: nil, error: e.message, status_code: nil, raw_response: nil)
end

#scrape_url(url:, render: true) ⇒ Result

Scrape any URL with universal source

Parameters:

  • url (String)

    URL to scrape

  • render (Boolean) (defaults to: true)

    Whether to render JavaScript

Returns:



226
227
228
229
230
231
232
# File 'app/services/retailer/oxylabs_api.rb', line 226

def scrape_url(url:, render: true)
  request(
    source: 'universal',
    url: url,
    render: render ? 'html' : nil
  )
end

#submit_job(payload, callback_url: nil) ⇒ JobResult

Submit a job asynchronously (Push-Pull method)
Returns immediately with a job_id for later retrieval.
Best for batch processing many items.

Parameters:

  • payload (Hash)

    Request payload (same as realtime)

  • callback_url (String, nil) (defaults to: nil)

    Optional webhook URL for result delivery

Returns:



111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
# File 'app/services/retailer/oxylabs_api.rb', line 111

def submit_job(payload, callback_url: nil)
  # Pre-flight ceiling only — do NOT count yet. Spend is recorded once
  # Oxylabs accepts the job, so rate-limited/rejected submissions don't burn
  # the daily budget. A runaway loop is still blocked once the count hits the
  # ceiling.
  Retailer::DailyCostGuard.ensure_within_budget!

  # Add callback_url if provided (for webhook delivery instead of polling)
  payload = payload.merge(callback_url: callback_url) if callback_url.present?

  @logger.info "[Oxylabs] Submitting async job: #{payload.except(:context).to_json}"

  response = post_async_with_retry(payload)
  result = handle_job_submission(response)

  # Only accepted submissions cost money — count those, not the rejects.
  Retailer::DailyCostGuard.record! if result.success?

  result
rescue Retailer::DailyCostGuard::BudgetExceeded => e
  JobResult.new(success: false, job_id: nil, error: e.message, status: 'error')
rescue HTTP::Error => e
  @logger.error "[Oxylabs] Async submit error: #{e.message}"
  JobResult.new(success: false, job_id: nil, error: e.message, status: 'error')
rescue StandardError => e
  @logger.error "[Oxylabs] Async error: #{e.class} - #{e.message}"
  JobResult.new(success: false, job_id: nil, error: e.message, status: 'error')
end

#submit_jobs_batch(payloads) ⇒ Array<JobResult>

Submit multiple jobs in parallel
Each payload may already contain its own callback_url for per-job tracking

Parameters:

  • payloads (Array<Hash>)

    Array of request payloads (may include callback_url)

Returns:

  • (Array<JobResult>)

    Array of job results with job_ids



144
145
146
147
148
149
150
151
152
153
154
155
156
157
# File 'app/services/retailer/oxylabs_api.rb', line 144

def submit_jobs_batch(payloads)
  concurrency = submit_concurrency
  @logger.info "[Oxylabs] Submitting #{payloads.size} async jobs (#{concurrency} at a time)"

  # Cap in-flight submissions: fan out one wave of `concurrency` at a time
  # rather than one thread per payload. The barrier between waves keeps the
  # burst under Oxylabs' submission rate limit (an unbounded fan-out had most
  # submissions throttled). Results stay in input order.
  payloads.each_slice(concurrency).flat_map do |wave|
    Heatwave::Parallel.map(wave) do |payload|
      submit_job(payload.except(:callback_url), callback_url: payload[:callback_url])
    end
  end
end