Class: SiteMapContentExtractionWorker

Inherits:
Object
  • Object
show all
Includes:
Sidekiq::IterableJob, Sidekiq::Worker
Defined in:
app/workers/site_map_content_extraction_worker.rb

Overview

Crawls all cacheable pages (all categories except publications and videos) to
refresh extracted content, rendered schema, and the internal link graph.

Uses Sidekiq::IterableJob so progress is saved after each page — a mid-run
deploy or worker restart resumes from the last successful page rather than
restarting from the top.

AI-powered SEO analysis (SeoPageAnalysisWorker) is intentionally excluded here
due to cost — trigger that manually from the CRM per-page or in targeted batches.

Triggered by:

  • Nightly cron (config/sidekiq_production_schedule.yml)
  • SitemapRegeneratedHandler (via Events::SitemapRegenerated)

Examples:

Queue for all locales and categories

SiteMapContentExtractionWorker.perform_async

Queue for a single locale

SiteMapContentExtractionWorker.perform_async('locale' => 'en-US')

Queue for a specific category

SiteMapContentExtractionWorker.perform_async('category' => 'post')

Instance Method Summary collapse

Instance Method Details

#build_enumerator(options = nil, cursor:) ⇒ Object

Parameters:

  • options (Hash, nil) (defaults to: nil)

    Optional scope filters

Options Hash (options):

  • 'locale' (String)

    Crawl only pages with this locale

  • 'category' (String)

    Crawl only pages with this category



36
37
38
39
40
41
42
43
44
45
46
# File 'app/workers/site_map_content_extraction_worker.rb', line 36

def build_enumerator(options = nil, cursor:)
  opts     = options.to_h.with_indifferent_access
  locale   = opts[:locale]
  category = opts[:category]

  pages = SiteMap.cacheable
  pages = pages.where(locale: locale)     if locale.present?
  pages = pages.where(category: category) if category.present?

  active_record_records_enumerator(pages, cursor: cursor)
end

#each_iteration(site_map, *_args) ⇒ Object



48
49
50
51
52
53
54
55
56
57
58
59
# File 'app/workers/site_map_content_extraction_worker.rb', line 48

def each_iteration(site_map, *_args)
  results = Cache::SiteCrawler.new.process(
    pages: SiteMap.where(id: site_map.id),
    extract_content: true
  )
  status = results.values.first
  log_info "#{site_map.locale} #{site_map.path}#{status}"
rescue StandardError => e
  log_error "Failed #{site_map.path}: #{e.message}"
  ErrorReporting.error(e)
  raise
end