Class: Embedding::TextUnifier

Inherits:
Object
  • Object
show all
Defined in:
app/services/embedding/text_unifier.rb

Overview

Backfills Gemini Embedding 2 vectors into content_embeddings.unified_embedding
for text content, so text and images share one multimodal vector space —
the headline benefit of Gemini Embedding 2 over the legacy OpenAI text path.

It reads each source content_type = 'primary' text row, re-embeds its
content with Gemini (batched), and upserts a sibling content_type = 'unified' row tagged gemini-embedding-2. The OpenAI embedding column and
rows are left untouched, so the existing ContentEmbedding.semantic_search
keeps serving search until the read path is cut over to unified_search
(a deliberate follow-up once coverage reaches 100%).

INERT by default: nothing invokes this from a model callback or the live
search path. Run it explicitly via rake embeddings:backfill_unified_text
(count-first, gated) per the runbook.

See Also:

  • doc/tasks/202606051030_TEXT_EMBEDDING_UNIFICATIONdoc/tasks/202606051030_TEXT_EMBEDDING_UNIFICATION.md

Constant Summary collapse

TEXT_TYPES =

Embeddable TEXT types eligible for unification — every embeddable type
except Image (images are embedded multimodally by the image pipeline).
Includes the sensitive internal types (CallRecord/Activity/Communication).

%w[
  Post Article Showcase Video Item ProductLine SiteMap ReviewsIo
  CallRecord Activity Communication AssistantBrainEntry
].freeze
MODEL =

GA multimodal model written into unified_embedding.

ContentEmbedding::UNIFIED_MODEL
DIMENSIONS =

MRL output width (HNSW-compatible; matches image unified embeddings).

1536
BATCH_SIZE =

Items per Gemini batchEmbedContents request.

Embedding::Gemini::MAX_BATCH_SIZE
MAX_CONTENT_LENGTH =

Truncate to stay within the model's ~8k-token text window.

Models::Embeddable::MAX_CONTENT_LENGTH

Class Method Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(dimensions: DIMENSIONS) ⇒ TextUnifier

Returns a new instance of TextUnifier.



48
49
50
# File 'app/services/embedding/text_unifier.rb', line 48

def initialize(dimensions: DIMENSIONS)
  @dimensions = dimensions
end

Class Method Details

.backfill(primary_rows, dimensions: DIMENSIONS) ⇒ Hash

Backfill a set of source primary rows.

Parameters:

  • primary_rows (ActiveRecord::Relation, Array<ContentEmbedding>)

    content_type='primary' text rows to mirror into the unified space

  • dimensions (Integer) (defaults to: DIMENSIONS)

    output vector width

Returns:

  • (Hash)

    counts — :processed, :skipped, :failed



44
45
46
# File 'app/services/embedding/text_unifier.rb', line 44

def self.backfill(primary_rows, dimensions: DIMENSIONS)
  new(dimensions: dimensions).backfill(primary_rows)
end

Instance Method Details

#backfill(primary_rows) ⇒ Object



52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
# File 'app/services/embedding/text_unifier.rb', line 52

def backfill(primary_rows)
  counts = { processed: 0, skipped: 0, failed: 0 }

  each_batch(primary_rows) do |rows|
    prepared = rows.filter_map do |row|
      content = content_for(row)
      if content.blank?
        counts[:skipped] += 1
        nil
      else
        { row: row, content: content }
      end
    end
    next if prepared.empty?

    vectors = Embedding::Gemini.embed_texts(prepared.pluck(:content), dimensions: @dimensions)
    write_batch(prepared, vectors, counts)
  end

  counts
end