Module: ContentEmbedding::UnifiedSearchable

Extended by:
ActiveSupport::Concern
Included in:
ContentEmbedding
Defined in:
app/models/concerns/content_embedding/unified_searchable.rb

Overview

Concern providing cross-modal "unified" search over Gemini Embedding 2 vectors
stored in the unified_embedding column. Text and images are embedded into the
same multimodal space, so a single query retrieves both.

These live alongside the OpenAI-space search (semantic_search / hybrid_search)
on the base model. SemanticSearchService routes here when the unified-cutover
flag is on. Sensitive types are excluded by default (exclude_sensitive:) —
required parity with the OpenAI path before any public/MCP cutover.

Base-model constants are referenced fully-qualified (e.g.
ContentEmbedding::UNIFIED_MODEL) because the compact module ContentEmbedding::UnifiedSearchable form does not place ContentEmbedding in
the lexical constant-lookup scope.

Examples:

ContentEmbedding.unified_hybrid_search("bathroom with heated floors")

Class Method Summary collapse

Class Method Details

.generate_unified_query_embedding(query, model: ContentEmbedding::UNIFIED_MODEL, dimensions: ContentEmbedding::UNIFIED_DIMENSIONS) ⇒ Array<Float>?

Generate (and cache) a query embedding via Gemini Embedding 2 — the only
embedding model. The model arg is retained for cache-key namespacing and
caller back-compat; every query embeds through Gemini regardless.

Parameters:

  • query (String)

    Text to embed

  • model (String) (defaults to: ContentEmbedding::UNIFIED_MODEL)

    Target embedding model (cache namespace)

  • dimensions (Integer) (defaults to: ContentEmbedding::UNIFIED_DIMENSIONS)

    Vector dimensions for the model

Returns:

  • (Array<Float>, nil)

    Embedding vector, or nil on error



157
158
159
160
161
162
163
164
165
166
167
168
169
# File 'app/models/concerns/content_embedding/unified_searchable.rb', line 157

def generate_unified_query_embedding(query, model: ContentEmbedding::UNIFIED_MODEL, dimensions: ContentEmbedding::UNIFIED_DIMENSIONS)
  cache_key = "unified_query_embedding:#{model}:#{Digest::SHA256.hexdigest(query.downcase.strip)[0..15]}"
  cached = Rails.cache.read(cache_key)
  return cached if cached.present?

  vector = Embedding::Gemini.embed_query(query, dimensions: dimensions)

  Rails.cache.write(cache_key, vector, expires_in: 24.hours) if vector.present?
  vector
rescue StandardError => e
  Rails.logger.error "Failed to generate unified query embedding (#{model}): #{e.message}"
  nil
end

.unified_hybrid_search(query, limit: 10, types: nil, locale: 'en', published_only: true, k: 60, min_similarity: ContentEmbedding::SEMANTIC_SIMILARITY_THRESHOLD, exclude_sensitive: true, model: ContentEmbedding::UNIFIED_MODEL) ⇒ Array<ContentEmbedding>

Hybrid (vector + keyword RRF) search over the unified space — parity with
hybrid_search, but on Gemini vectors and cross-modal.

Parameters:

  • query (String)

    Natural language search query

  • limit (Integer) (defaults to: 10)

    Maximum results

  • types (Array<String>, nil) (defaults to: nil)

    Filter by embeddable types

  • locale (String) (defaults to: 'en')

    Locale filter

  • published_only (Boolean) (defaults to: true)

    Only published/active content

  • k (Integer) (defaults to: 60)

    RRF constant (default 60)

  • min_similarity (Float) (defaults to: ContentEmbedding::SEMANTIC_SIMILARITY_THRESHOLD)

    Minimum cosine similarity for the vector half

  • exclude_sensitive (Boolean) (defaults to: true)

    Exclude SENSITIVE_TYPES (default: true)

  • model (String) (defaults to: ContentEmbedding::UNIFIED_MODEL)

    Embedding model

Returns:



86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
# File 'app/models/concerns/content_embedding/unified_searchable.rb', line 86

def unified_hybrid_search(query, limit: 10, types: nil, locale: 'en', published_only: true, k: 60,
                          min_similarity: ContentEmbedding::SEMANTIC_SIMILARITY_THRESHOLD,
                          exclude_sensitive: true, model: ContentEmbedding::UNIFIED_MODEL)
  return [] if query.blank?

  fetch_limit = [limit * 3, 30].max

  vector_results = unified_search(query, model: model, limit: fetch_limit, types: types, locale: locale,
                                  published_only: published_only, min_similarity: min_similarity,
                                  exclude_sensitive: exclude_sensitive).to_a

  # Keyword half must rank the SAME row set as the vector half (unified rows),
  # so RRF fuses by aligned content_embedding ids — mirror unified_search's
  # model selection + with_unified_embedding exactly.
  search_models = model == ContentEmbedding::UNIFIED_MODEL ? ContentEmbedding::UNIFIED_MODELS : model
  unified_scope = where(content_type: 'unified', embedding_model: search_models).with_unified_embedding
  keyword_results = keyword_search_for_rrf(query, fetch_limit, types, locale, published_only,
                                           exclude_sensitive: exclude_sensitive, base_scope: unified_scope)

  rrf_scores = calculate_rrf_scores(vector_results, keyword_results, k)
  sorted_entries = rrf_scores.sort_by { |_id, score| -score }.first(limit)
  return [] if sorted_entries.empty?

  score_map = sorted_entries.to_h
  sorted_ids = sorted_entries.map(&:first)
  records = where(id: sorted_ids).includes(:embeddable).index_by(&:id)
  sorted_ids.filter_map do |id|
    record = records[id]
    next unless record

    # Store RRF score as a virtual distance for similarity_score parity.
    record.define_singleton_method(:neighbor_distance) { 1.0 - score_map[id] }
    record
  end
end

.unified_search(query, model: ContentEmbedding::UNIFIED_MODEL, limit: 10, types: nil, locale: 'en', published_only: true, min_similarity: ContentEmbedding::SEMANTIC_SIMILARITY_THRESHOLD, exclude_sensitive: true) ⇒ ActiveRecord::Relation

Vector search over the unified (Gemini) space. Spans every embeddable type —
images included — since they share one vector space.

Parameters:

  • query (String)

    Natural language search query

  • model (String) (defaults to: ContentEmbedding::UNIFIED_MODEL)

    Embedding model (selects the partial index)

  • limit (Integer) (defaults to: 10)

    Maximum results

  • types (Array<String>, nil) (defaults to: nil)

    Filter by embeddable types

  • locale (String) (defaults to: 'en')

    Locale filter

  • published_only (Boolean) (defaults to: true)

    Only published/active content

  • min_similarity (Float) (defaults to: ContentEmbedding::SEMANTIC_SIMILARITY_THRESHOLD)

    Drop results below this cosine similarity

  • exclude_sensitive (Boolean) (defaults to: true)

    Exclude SENSITIVE_TYPES (default: true)

Returns:

  • (ActiveRecord::Relation)

    Embeddings ordered by similarity

Raises:

  • (ArgumentError)


35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
# File 'app/models/concerns/content_embedding/unified_searchable.rb', line 35

def unified_search(query, model: ContentEmbedding::UNIFIED_MODEL, limit: 10, types: nil, locale: 'en',
                   published_only: true, min_similarity: ContentEmbedding::SEMANTIC_SIMILARITY_THRESHOLD,
                   exclude_sensitive: true)
  return none if query.blank?

  model_config = ContentEmbedding::EMBEDDING_MODELS[model]
  raise ArgumentError, "Unknown embedding model: #{model}" unless model_config

  dimensions = model_config[:dimensions]
  query_embedding = generate_unified_query_embedding(query, model: model, dimensions: dimensions)
  return none unless query_embedding

  # Query vector is bound via sanitize_sql_array; dimensions is a trusted integer.
  vector_literal = "[#{query_embedding.join(',')}]"
  distance_sql = sanitize_sql_array(["unified_embedding::vector(#{dimensions.to_i}) <=> ?::vector", vector_literal])

  # A GA-model search also matches transitional preview rows pending re-embed.
  scope = by_model(model == ContentEmbedding::UNIFIED_MODEL ? ContentEmbedding::UNIFIED_MODELS : model)
          .with_unified_embedding
          .select("#{table_name}.*, #{distance_sql} AS neighbor_distance")
          .order(Arel.sql(distance_sql))

  scope = scope.mcp_safe if exclude_sensitive

  if min_similarity.positive?
    # Cosine distance is 0–2; convert the similarity floor with the same
    # convention as ContentEmbedding#similarity_score (similarity =
    # 1 - distance/2) so min_similarity matches the displayed score.
    max_distance = (2.0 * (1.0 - min_similarity)).round(6)
    scope = scope.where(sanitize_sql_array(["unified_embedding::vector(#{dimensions.to_i}) <=> ?::vector <= ?", vector_literal, max_distance]))
  end

  scope = scope.by_type(types) if types.present?
  scope = scope.for_locale(locale)
  scope = scope.published_only if published_only
  scope.limit(limit).includes(:embeddable)
end

.unified_visual_search(query, model: ContentEmbedding::UNIFIED_MODEL, limit: 10) ⇒ ActiveRecord::Relation

Cross-modal visual search (text -> image) over the unified space.

Parameters:

  • query (String)

    Text description of desired images

  • model (String) (defaults to: ContentEmbedding::UNIFIED_MODEL)

    Embedding model (should be multimodal)

  • limit (Integer) (defaults to: 10)

    Maximum results

Returns:

  • (ActiveRecord::Relation)

    Image embeddings ordered by similarity

Raises:

  • (ArgumentError)


128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
# File 'app/models/concerns/content_embedding/unified_searchable.rb', line 128

def unified_visual_search(query, model: ContentEmbedding::UNIFIED_MODEL, limit: 10)
  return none if query.blank?

  model_config = ContentEmbedding::EMBEDDING_MODELS[model]
  raise ArgumentError, "Unknown embedding model: #{model}" unless model_config

  dimensions = model_config[:dimensions]
  query_embedding = generate_unified_query_embedding(query, model: model, dimensions: dimensions)
  return none unless query_embedding

  distance_sql = sanitize_sql_array(["unified_embedding::vector(#{dimensions.to_i}) <=> ?::vector", "[#{query_embedding.join(',')}]"])

  by_model(model == ContentEmbedding::UNIFIED_MODEL ? ContentEmbedding::UNIFIED_MODELS : model)
    .where(embeddable_type: 'Image')
    .with_unified_embedding
    .select("#{table_name}.*, #{distance_sql} AS neighbor_distance")
    .order(Arel.sql(distance_sql))
    .limit(limit)
    .includes(:embeddable)
end