Class: TextUnifiedEmbeddingBackfillWorker

Inherits:
Object
  • Object
show all
Includes:
Sidekiq::IterableJob, Sidekiq::Job
Defined in:
app/workers/text_unified_embedding_backfill_worker.rb

Overview

Backfills Gemini Embedding 2 vectors into the unified vector space for TEXT
content, so text and images become cross-modally searchable from one query.

Dual-write, no delete: for each source content_type='primary' text row it
upserts a content_type='unified' sibling tagged gemini-embedding-2 (via
Embedding::TextUnifier), leaving the OpenAI embedding rows and the live
search path untouched. The read path is cut over to the unified space only
after coverage reaches 100% — a deliberate follow-up, not part of this worker.

Self-throttling: Embedding::Gemini caps at 300 req/min, and each iteration is
one batchEmbedContents request (≤100 rows), so the corpus drains without any
mass enqueue. Resumable: Sidekiq::IterableJob checkpoints the PK cursor after
every batch, so a deploy or restart resumes mid-run.

Scheduled nightly; idempotent (skips rows that already have a Gemini unified
sibling), so it's safe to re-run and converges to full coverage.

Examples:

Backfill all eligible text types (default)

TextUnifiedEmbeddingBackfillWorker.perform_async

Restrict to specific types

TextUnifiedEmbeddingBackfillWorker.perform_async('types' => %w[Post Article])

See Also:

  • doc/tasks/202606051030_TEXT_EMBEDDING_UNIFICATIONdoc/tasks/202606051030_TEXT_EMBEDDING_UNIFICATION.md

Constant Summary collapse

BATCH_SIZE =

One Gemini batchEmbedContents request per iteration.

Embedding::Gemini::MAX_BATCH_SIZE

Instance Method Summary collapse

Instance Method Details

#build_enumerator(options = nil, cursor:) ⇒ Object



36
37
38
39
40
41
42
43
44
45
# File 'app/workers/text_unified_embedding_backfill_worker.rb', line 36

def build_enumerator(options = nil, cursor:)
  opts = options.to_h.with_indifferent_access
  @types = Array(opts[:types]).presence || Embedding::TextUnifier::TEXT_TYPES
  @counts = { processed: 0, skipped: 0, failed: 0 }

  scope = candidate_scope
  log_info "Starting: #{scope.count} primary rows need a Gemini unified sibling (types: #{@types.join(',')})"

  active_record_batches_enumerator(scope, cursor: cursor, batch_size: BATCH_SIZE)
end

#each_iteration(batch, *_args) ⇒ Object



47
48
49
50
51
52
# File 'app/workers/text_unified_embedding_backfill_worker.rb', line 47

def each_iteration(batch, *_args)
  result = Embedding::TextUnifier.backfill(batch)
  @counts.each_key { |key| @counts[key] += result[key] }

  log_info "Progress: #{@counts.inspect}" if (@counts[:processed] % 1000).zero? && @counts[:processed].positive?
end

#on_completeObject



54
55
56
# File 'app/workers/text_unified_embedding_backfill_worker.rb', line 54

def on_complete
  log_info "Complete: #{@counts.inspect}"
end