Semantic Search with Vector Embeddings

Overview

This feature enables AI-powered semantic search across all content types using OpenAI embeddings and pgvector. Users can search by meaning rather than exact keywords, enabling queries like "find showcases about snow melting under pavers" or "videos showing bathroom floor heating installation".

Architecture

┌─────────────────────────────────────────────────────────────────┐
│                    Content Sources                               │
│  Posts, Showcases, Videos, Images, Pages, Products              │
│                           │                                      │
│              Models::Embeddable (Concern)                        │
│                           │                                      │
│                           ▼                                      │
│  ┌────────────────────────────────────────────────────────────┐ │
│  │            content_embeddings (polymorphic)                 │ │
│  │   embedding: vector(1536) with HNSW cosine index           │ │
│  └────────────────────────────────────────────────────────────┘ │
│                           │                                      │
│           RubyLLM.embed (text-embedding-3-small)                │
│                           │                                      │
│              SemanticSearchService                               │
└─────────────────────────────────────────────────────────────────┘

Components

1. Database Tables

content_embeddings - Polymorphic table storing vector embeddings
page_contents - Stores extracted content from static ERB pages

2. Models

ContentEmbedding - Core model for storing and querying embeddings
PageContent - Model for static page content with extraction

3. Concern

Models::Embeddable - Include in any model to enable embedding generation

4. Services

SemanticSearchService - High-level search interface
EmbeddingWorker - Background job for embedding generation

Setup

1. Install Dependencies

bundle install  # Adds 'neighbor' gem

2. Run Migrations

bundle exec rails db:migrate

This will:

Enable the pgvector extension
Create the content_embeddings table with HNSW index
Create the page_contents table

3. Generate Initial Embeddings

# Generate all embeddings (may take 10-30 minutes depending on content volume)
bundle exec rake embeddings:all

# Or generate by content type
bundle exec rake embeddings:posts
bundle exec rake embeddings:showcases
bundle exec rake embeddings:videos
bundle exec rake embeddings:images
bundle exec rake embeddings:pages

Usage

Basic Search

# Search across all content types
results = SemanticSearchService.search("snow melting under pavers")

# Search specific types
results = SemanticSearchService.new("heated driveway", types: ['showcases', 'posts']).search

# Results include similarity scores
results.each do |r|
  puts "#{r[:type]}: #{r[:record].name} (#{(r[:similarity] * 100).round}% match)"
end

Convenience Methods

# Find showcases
showcases = SemanticSearchService.find_showcases("bathroom floor heating")

# Find videos
videos = SemanticSearchService.find_videos("installation guide")

# Find blog posts
posts = SemanticSearchService.find_posts("heated driveway cost")

# Find products
products = SemanticSearchService.find_products("snow melting mat")

# Find pages
pages = SemanticSearchService.find_pages("warranty information")

Model-Level Search

# Search within a model
Showcase.semantic_search("modern bathroom radiant heat")

# Find similar content
showcase = Showcase.find(123)
similar = showcase.find_similar(limit: 5)

# Cross-type similarity
showcase.find_similar(same_type_only: false)

Manual Embedding Generation

# Generate embedding for a record
post.generate_embedding!(:primary)

# Force regeneration
post.generate_embedding!(:primary, force: true)

# Generate all content types
video.generate_all_embeddings!

# Check if stale
post.embedding_stale?  # => true/false

Adding Embeddable to New Models

class MyModel < ApplicationRecord
  include Models::Embeddable
  
  # Define content types to embed
  def self.embeddable_content_types
    [:primary, :summary]
  end
  
  # Provide content for embedding
  def content_for_embedding(content_type = :primary)
    case content_type.to_sym
    when :primary
      [title, description, body].compact.join("\n\n")
    when :summary
      short_description
    end
  end
  
  private
  
  # Trigger re-embedding when content changes
  def embedding_content_changed?
    saved_change_to_title? || saved_change_to_body?
  end
end

Rake Tasks

# Generate embeddings
bundle exec rake embeddings:all          # All content types
bundle exec rake embeddings:posts        # Blog posts only
bundle exec rake embeddings:showcases    # Showcases only
bundle exec rake embeddings:videos       # Videos only
bundle exec rake embeddings:images       # Images only
bundle exec rake embeddings:pages        # Static pages only

# Maintenance
bundle exec rake embeddings:stats        # Show embedding counts
bundle exec rake embeddings:refresh_stale # Regenerate stale embeddings
bundle exec rake embeddings:clear        # Delete all embeddings (careful!)

# Testing
bundle exec rake "embeddings:search[snow melting heated driveway]"

Cost Estimation

Using OpenAI's text-embedding-3-small at $0.02 per 1M tokens:

Content Type	Est. Count	Avg Tokens	Est. Cost
Posts	~500	2,000	~$0.02
Showcases	~200	500	~$0.002
Videos	~300	1,500	~$0.009
Images	~2,000	200	~$0.008
Pages	~100	3,000	~$0.006
Total			~$0.05

Ongoing costs are minimal as embeddings only regenerate when content changes.

Technical Details

Embedding Model

Model: text-embedding-3-small
Dimensions: 1536
Max input: ~8,000 tokens (~30,000 characters)

Index Strategy

Index type: HNSW (Hierarchical Navigable Small World)
Distance metric: Cosine similarity
Benefits: Fast queries (~1-5ms), good recall

Content Types

Type	Description
primary	Main text content (title, description, body)
visual	Image/video descriptions for visual search
transcript	Full video transcripts
specifications	Product specifications

Troubleshooting

No results returned

Check if embeddings exist: ContentEmbedding.count
Run embedding generation: bundle exec rake embeddings:all
Verify OpenAI API key is configured

Slow queries

Ensure HNSW index exists: Check idx_embeddings_hnsw_cosine
Consider reducing result limit
Check for missing indexes on polymorphic columns

Stale embeddings

Run bundle exec rake embeddings:stats to check counts
Run bundle exec rake embeddings:refresh_stale to update

API errors

Check log/sidekiq.log for worker errors
Verify OpenAI API quota
EmbeddingWorker includes retry logic with exponential backoff

Related Files

app/models/content_embedding.rb
app/models/page_content.rb
app/concerns/models/embeddable.rb
app/workers/embedding_worker.rb
app/services/semantic_search_service.rb
lib/tasks/embeddings.rake
db/migrate/20251214200000_enable_pgvector_extension.rb
db/migrate/20251214200001_create_content_embeddings.rb
db/migrate/20251214200002_create_page_contents.rb