Semantic Search with Vector Embeddings

Overview

This feature enables AI-powered semantic search across all content types using OpenAI embeddings and pgvector. Users can search by meaning rather than exact keywords, enabling queries like "find showcases about snow melting under pavers" or "videos showing bathroom floor heating installation".

Architecture

┌─────────────────────────────────────────────────────────────────┐
│                    Content Sources                               │
│  Posts, Showcases, Videos, Images, Pages, Products              │
│                           │                                      │
│              Models::Embeddable (Concern)                        │
│                           │                                      │
│                           ▼                                      │
│  ┌────────────────────────────────────────────────────────────┐ │
│  │            content_embeddings (polymorphic)                 │ │
│  │   embedding: vector(1536) with HNSW cosine index           │ │
│  └────────────────────────────────────────────────────────────┘ │
│                           │                                      │
│           RubyLLM.embed (text-embedding-3-small)                │
│                           │                                      │
│              SemanticSearchService                               │
└─────────────────────────────────────────────────────────────────┘

Components

1. Database Tables

  • content_embeddings - Polymorphic table storing vector embeddings
  • page_contents - Stores extracted content from static ERB pages

2. Models

  • ContentEmbedding - Core model for storing and querying embeddings
  • PageContent - Model for static page content with extraction

3. Concern

  • Models::Embeddable - Include in any model to enable embedding generation

4. Services

  • SemanticSearchService - High-level search interface
  • EmbeddingWorker - Background job for embedding generation

Setup

1. Install Dependencies

bundle install  # Adds 'neighbor' gem

2. Run Migrations

bundle exec rails db:migrate

This will:

  • Enable the pgvector extension
  • Create the content_embeddings table with HNSW index
  • Create the page_contents table

3. Generate Initial Embeddings

# Generate all embeddings (may take 10-30 minutes depending on content volume)
bundle exec rake embeddings:all

# Or generate by content type
bundle exec rake embeddings:posts
bundle exec rake embeddings:showcases
bundle exec rake embeddings:videos
bundle exec rake embeddings:images
bundle exec rake embeddings:pages

Usage

Basic Search

# Search across all content types
results = SemanticSearchService.search("snow melting under pavers")

# Search specific types
results = SemanticSearchService.new("heated driveway", types: ['showcases', 'posts']).search

# Results include similarity scores
results.each do |r|
  puts "#{r[:type]}: #{r[:record].name} (#{(r[:similarity] * 100).round}% match)"
end

Convenience Methods

# Find showcases
showcases = SemanticSearchService.find_showcases("bathroom floor heating")

# Find videos
videos = SemanticSearchService.find_videos("installation guide")

# Find blog posts
posts = SemanticSearchService.find_posts("heated driveway cost")

# Find products
products = SemanticSearchService.find_products("snow melting mat")

# Find pages
pages = SemanticSearchService.find_pages("warranty information")

Model-Level Search

# Search within a model
Showcase.semantic_search("modern bathroom radiant heat")

# Find similar content
showcase = Showcase.find(123)
similar = showcase.find_similar(limit: 5)

# Cross-type similarity
showcase.find_similar(same_type_only: false)

Manual Embedding Generation

# Generate embedding for a record
post.generate_embedding!(:primary)

# Force regeneration
post.generate_embedding!(:primary, force: true)

# Generate all content types
video.generate_all_embeddings!

# Check if stale
post.embedding_stale?  # => true/false

Adding Embeddable to New Models

class MyModel < ApplicationRecord
  include Models::Embeddable
  
  # Define content types to embed
  def self.embeddable_content_types
    [:primary, :summary]
  end
  
  # Provide content for embedding
  def content_for_embedding(content_type = :primary)
    case content_type.to_sym
    when :primary
      [title, description, body].compact.join("\n\n")
    when :summary
      short_description
    end
  end
  
  private
  
  # Trigger re-embedding when content changes
  def embedding_content_changed?
    saved_change_to_title? || saved_change_to_body?
  end
end

Rake Tasks

# Generate embeddings
bundle exec rake embeddings:all          # All content types
bundle exec rake embeddings:posts        # Blog posts only
bundle exec rake embeddings:showcases    # Showcases only
bundle exec rake embeddings:videos       # Videos only
bundle exec rake embeddings:images       # Images only
bundle exec rake embeddings:pages        # Static pages only

# Maintenance
bundle exec rake embeddings:stats        # Show embedding counts
bundle exec rake embeddings:refresh_stale # Regenerate stale embeddings
bundle exec rake embeddings:clear        # Delete all embeddings (careful!)

# Testing
bundle exec rake "embeddings:search[snow melting heated driveway]"

Cost Estimation

Using OpenAI's text-embedding-3-small at $0.02 per 1M tokens:

Content Type Est. Count Avg Tokens Est. Cost
Posts ~500 2,000 ~$0.02
Showcases ~200 500 ~$0.002
Videos ~300 1,500 ~$0.009
Images ~2,000 200 ~$0.008
Pages ~100 3,000 ~$0.006
Total ~$0.05

Ongoing costs are minimal as embeddings only regenerate when content changes.

Technical Details

Embedding Model

  • Model: text-embedding-3-small
  • Dimensions: 1536
  • Max input: ~8,000 tokens (~30,000 characters)

Index Strategy

  • Index type: HNSW (Hierarchical Navigable Small World)
  • Distance metric: Cosine similarity
  • Benefits: Fast queries (~1-5ms), good recall

Content Types

Type Description
primary Main text content (title, description, body)
visual Image/video descriptions for visual search
transcript Full video transcripts
specifications Product specifications

Troubleshooting

No results returned

  1. Check if embeddings exist: ContentEmbedding.count
  2. Run embedding generation: bundle exec rake embeddings:all
  3. Verify OpenAI API key is configured

Slow queries

  1. Ensure HNSW index exists: Check idx_embeddings_hnsw_cosine
  2. Consider reducing result limit
  3. Check for missing indexes on polymorphic columns

Stale embeddings

  1. Run bundle exec rake embeddings:stats to check counts
  2. Run bundle exec rake embeddings:refresh_stale to update

API errors

  1. Check log/sidekiq.log for worker errors
  2. Verify OpenAI API quota
  3. EmbeddingWorker includes retry logic with exponential backoff

Related Files

  • app/models/content_embedding.rb
  • app/models/page_content.rb
  • app/concerns/models/embeddable.rb
  • app/workers/embedding_worker.rb
  • app/services/semantic_search_service.rb
  • lib/tasks/embeddings.rake
  • db/migrate/20251214200000_enable_pgvector_extension.rb
  • db/migrate/20251214200001_create_content_embeddings.rb
  • db/migrate/20251214200002_create_page_contents.rb