Class: EmbeddingRefreshWorker

Inherits:
Object
  • Object
show all
Includes:
Sidekiq::Worker
Defined in:
app/workers/embedding_refresh_worker.rb

Overview

Scheduled worker to detect and regenerate stale embeddings.
Runs weekly to catch content changes that weren't detected by model callbacks,
such as HABTM association changes (product_lines, categories, items).

For Images, checks for missing unified embeddings and Vision analysis.
Uses the Gemini pipeline: pHash → Gemini Embedding 2 → Gemini Flash Vision

Scheduled: Sundays at 3:30am (after sitemap generation at 1:10am)
Queue: low priority to avoid impacting user-facing operations

Examples:

Manual invocation

EmbeddingRefreshWorker.new.perform

Constant Summary collapse

EMBEDDABLE_TYPES =

Content types to check for stale embeddings
Order matters: check smaller collections first to spread load

Note: CallRecord transcription is handled by DailyCallRecordTranscriptionWorker,
but we still check here for:

  • LeMUR re-runs that update content without regenerating embeddings
  • Failed embedding jobs (transcribed but no embedding)
  • Manual edits to call record metadata
%w[
  Post
  Article
  Showcase
  Video
  Image
  SiteMap
  ReviewsIo
  Item
  CallRecord
].freeze
MAX_RECORDS_PER_TYPE =

Maximum records to queue per run to avoid overwhelming the queue

500
MAX_IMAGES_FULL_ANALYSIS =

Maximum images to queue for full analysis per run

100
TYPE_CHECK_DELAY =

Delay between checking each type (seconds)

2

Instance Method Summary collapse

Instance Method Details

#performObject



50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
# File 'app/workers/embedding_refresh_worker.rb', line 50

def perform
  log_info 'Starting weekly embedding refresh check...'

  total_queued = 0
  stats = {}

  EMBEDDABLE_TYPES.each do |type|
    # Small delay between types to spread database load
    sleep(TYPE_CHECK_DELAY) unless total_queued.zero?

    count = check_and_queue_stale(type)
    stats[type] = count
    total_queued += count

    log_info "#{type}: #{count} stale embeddings queued" if count.positive?
  end

  # Queue images missing Vision analysis or unified embedding for full pipeline
  full_analysis_count = queue_images_for_full_analysis
  stats['Image_FullAnalysis'] = full_analysis_count
  total_queued += full_analysis_count

  # Also extract fresh content for static pages before embedding
  extract_static_pages_if_needed

  log_info "Embedding refresh complete. Total queued: #{total_queued}"
  log_info "Stats: #{stats.select { |_, v| v.positive? }}" if total_queued.positive?

  stats
rescue StandardError => e
  log_error "Embedding refresh failed: #{e.message}"
  ErrorReporting.error(e)
  raise
end