SEO System Architecture

Comprehensive reference for the SEO analysis, metrics, and optimization features.

Related docs:

System Overview

The SEO system has four pillars:

  1. Data Collection — nightly syncs from first-party visits, GSC, GA4, and Ahrefs
  2. Content Crawling — nightly extraction of page content, JSON-LD schema, and internal links
  3. AI Analysis — real-time (on-demand) and batch (nightly) AI-powered SEO analysis
  4. Recommendations — extracted, trackable action items from AI analysis

Nightly Pipeline

All SEO jobs run on a staggered schedule to ensure data is fresh before analysis:

1:00 AM  SiteMapContentExtractionWorker    Crawl pages for content, schema, links
1:10 AM  Sitemap::SitemapGenerator          Generate public sitemap XML
2:00 AM  SeoVisitsSyncWorker                Sync first-party visit counts
2:00 AM  SeoGscSyncWorker                   Sync GSC clicks, impressions, CTR
2:00 AM  SeoGa4SyncWorker                   Sync GA4 page views, sessions, engagement
2:30 AM  SeoAhrefsSyncWorker                Sync Ahrefs keywords, positions, backlinks
2:50 AM  TextEmbeddingPopulation (SiteMap)   Backfill SiteMap embeddings
3:00 AM  SeoBatchCollectorWorker            Phase 1: build AI prompts (zero tokens)
         -> SeoBatchSubmitWorker            Phase 2a: submit to Gemini Batch API
         -> SeoBatchPollWorker              Phase 2b: poll until complete
         -> SeoBatchResultsWorker           Phase 2c: save results + extract recommendations

Data Collection

Data Sources

Source Service Worker Data Captured
First-party visits Seo::VisitsSyncService SeoVisitsSyncWorker 30-day page view counts from Visits table
Google Search Console Seo::GscSyncService SeoGscSyncWorker Clicks, impressions, CTR, avg position (28-day)
Google Analytics 4 Seo::Ga4SyncService SeoGa4SyncWorker Page views, sessions, users, bounce rate, engagement rate
Ahrefs Seo::AhrefsSyncService SeoAhrefsSyncWorker Organic traffic, keyword count, positions, traffic value
Ahrefs (per-page) Seo::KeywordSyncService Organic keywords + Google Keyword Planner volume

API Clients

Client Auth Purpose
Seo::GscApiClient Google service account GSC Search Analytics API
Seo::Ga4ApiClient Google service account GA4 Data API
Seo::AhrefsApiClient API key (v3 REST) Ahrefs organic keywords, top pages
Seo::AhrefsMcpClient MCP JSON-RPC Ahrefs via MCP server
Seo::GoogleKeywordPlannerClient Google Ads API Search volume for keywords
Seo::KeywordsPeopleUseClient API key People Also Ask, autocomplete

Storage

Metrics are stored in two places:

  1. site_map_data_points — time-series fact table (source of truth for trends)
  2. Cached columns on site_mapsvisit_count_30d, seo_clicks, seo_traffic, etc. (for SQL sorting/filtering)

See SEO Metrics Data Model for schema details.

Content Crawling

Worker: SiteMapContentExtractionWorker (1 AM nightly)

Crawls cacheable pages via Cache::SiteCrawler and stores:

Column Content
extracted_content Sanitized page text (headings, paragraphs, lists)
extracted_title Page <title> tag
rendered_schema JSON-LD structured data found on the page
extracted_at Timestamp of last crawl

Also discovers and stores the internal link graph in SiteMapLink records (outbound/inbound editorial links between pages).

AI Analysis

Shared Configuration

All analysis paths (real-time, Gemini batch, Anthropic batch, sequential fallback) derive their config from constants on Seo::PageAnalysisService:

MAX_OUTPUT_TOKENS = 16_384
TEMPERATURE       = 0.3
THINKING_BUDGET   = 10_000    # premium only (Claude Opus)
ANALYSIS_MODEL    = AiModelConstants.id(:seo_analysis)          # gemini-3.5-flash
ANALYSIS_MODEL_PREMIUM = AiModelConstants.id(:anthropic_opus)   # claude-opus-4-8

Real-Time Analysis (On-Demand)

Triggered from the CRM via SeoPageAnalysisWorker:

CRM "Analyze" button
  -> SeoPageAnalysisWorker.perform_async(site_map_id)
     -> Crawl page (if stale >24h)
     -> Sync visits, GSC, GA4, keywords
     -> PageAnalysisService#generate_analysis
        -> RubyLLM.chat(model: gemini-3.5-flash)
        -> Provider-aware params (Gemini: generationConfig.maxOutputTokens, Anthropic: max_tokens)
        -> Structured JSON output via ANALYSIS_SCHEMA
     -> Save to SiteMap#seo_report
     -> Extract recommendations via RecommendationExtractorService

Full mode runs 8 steps (crawl + 4 syncs + gather context + AI + save).
Analysis-only mode (skip_syncs: true) runs 3 steps.

Nightly Batch Analysis (50% Cost)

A 4-worker pipeline using the Gemini Batch API:

Phase 1 — Collect (SeoBatchCollectorWorker, 3 AM):

  • Finds up to 500 pages needing analysis using tiered freshness:
    • High traffic (100+): every 7 days
    • Moderate (10-99): every 14 days
    • Low (1-9): every 30 days
  • Syncs Ahrefs keywords for each page
  • Builds prompts via PageAnalysisService#build_prompts (zero AI tokens)
  • Stores prompts in SeoBatchItem records

Phase 2a — Submit (SeoBatchSubmitWorker):

  • Routes by model prefix:
    • gemini-* → Gemini Batch API (Seo::GeminiBatchClient)
    • claude-* → Anthropic Message Batches API (Seo::AnthropicBatchClient)
    • Other → sequential RubyLLM fallback
  • Creates a Gemini context cache for the shared system prompt (75% savings on cached tokens)
  • Submits all requests in a single HTTP POST

Phase 2b — Poll (SeoBatchPollWorker):

  • Self-re-enqueues with exponential backoff: 2 min → 4 min → 8 min → 15 min (capped)
  • Max polling: 24 hours (120 attempts)
  • Terminal states: JOB_STATE_SUCCEEDED, JOB_STATE_FAILED, JOB_STATE_CANCELLED, JOB_STATE_EXPIRED

Phase 2c — Results (SeoBatchResultsWorker):

  • Parses each response (inline or file-based for large batches)
  • Saves analysis to SeoBatchItem#result and SiteMap#seo_report
  • Runs RecommendationExtractorService for each page
  • Deletes Gemini context cache to stop storage charges

Analysis Output

The AI returns a structured JSON object (ANALYSIS_SCHEMA) containing:

Section Content
current_state_analysis Title, content, keyword, and schema analysis (chain-of-thought)
overall_score 0-100 score
summary 2-3 sentence overview
competitive_position leading / competitive / average / struggling
strengths Findings with evidence
opportunities Gaps with evidence and recommendations
keyword_strategy Primary/secondary keywords, title action
internal_linking Recommended links with anchor text and placement
faq_recommendations FAQs to add (with FAQ IDs)
people_also_ask_content PAA questions with suggested answers
content_recommendations Content improvements with AI search benefit
technical_recommendations Technical fixes (array of strings)
structured_data_recommendations Schema.org improvements with AIO benefit
aio_recommendations AI Overview / GEO optimization recommendations
priority_actions Top actions ranked by impact and effort

Results are stored in SiteMap#seo_report (JSONB).

Recommendations

Service: Seo::RecommendationExtractorService
Model: SiteMapRecommendation

After each analysis, recommendations are extracted from the seo_report into individual SiteMapRecommendation records for tracking:

Field Purpose
category priority_action, internal_linking, faq_recommendation, content_recommendation, technical_recommendation, structured_data, aio_recommendation, people_also_ask
status pendingacceptedin_progresscompleted (or ignored, stale)
fingerprint Deduplication key — same recommendation from re-analysis merges rather than duplicates
impact / effort Priority matrix (high/medium/low)

Managed via Crm::SiteMapRecommendationsController with bulk update support.

Additional SEO Services

Cannibalization Detection

Seo::CannibalizationService detects when multiple pages compete for the same keyword, causing Google to switch ranking URLs.

Link Auditing

Service Purpose
Seo::ArticleLinkAuditor Audits both internal and external links in articles
Seo::InternalLinkValidator Validates internal links, upserts editorial links into SiteMapLink
Seo::LinkAnalyzer Checks external URLs for status codes and redirects

Content Sanitization

Service Purpose
Seo::HtmlContentSanitizer Cleans empty elements, inline styles, table classes
Seo::HtmlLinkSanitizer Normalizes link URLs (locale prefixes, hostnames)
Seo::HtmlHeadingSanitizer Normalizes heading tag hierarchy (h1-h6)
Seo::DeparameterizeLinks Strips query parameters from internal links
Seo::HtmlPrettyPrinter Formats HTML with HtmlBeautifier
Seo::ImageOptimizer Adds loading="lazy" to images
Seo::ImageMissingSizeFiller Fills missing width/height attributes

CRM Interface

Pages

Route View Purpose
/crm/site_maps Index Filterable list of all site maps with scores
/crm/site_maps/:id Show Full SEO report with recommendations
/crm/site_maps/action_items Action Items All pending recommendations across pages
/crm/seo_keywords Keywords Overview of all tracked keywords
/crm/seo_keywords/:id Keyword Detail Pages ranking for a specific keyword
/crm/metrics_analysis Metrics Time-series charts for SEO metrics

CRM Actions

Action Endpoint Effect
Analyze (full) POST /crm/site_maps/:id/analyze Queues SeoPageAnalysisWorker with all syncs
Analyze (AI only) POST /crm/site_maps/:id/analyze_only Queues with skip_syncs: true
Analyze (premium) POST /crm/site_maps/:id/analyze_premium Uses Claude Opus with extended thinking
Sync Keywords POST /crm/site_maps/:id/sync_keywords Runs KeywordSyncService inline
Sync Visits POST /crm/site_maps/:id/sync_visits Runs VisitsSyncService inline
Sync GSC POST /crm/site_maps/:id/sync_gsc Runs GSC sync inline
Sync GA4 POST /crm/site_maps/:id/sync_ga4 Runs GA4 sync inline
Recrawl POST /crm/site_maps/:id/recrawl Re-crawls page content and schema

Key Models

Model Table Purpose
SiteMap site_maps Pages tracked for SEO (URL, content, report, metrics)
SiteMapDataPoint site_map_data_points Time-series metrics (fact table)
SiteMapRecommendation site_map_recommendations Extracted action items with status tracking
SiteMapLink site_map_links Internal link graph (outbound/inbound editorial)
SeoPageKeyword seo_page_keywords Keywords tracked per page (position, volume, source)
SeoBatchJob seo_batch_jobs Batch analysis job (status, provider, metadata)
SeoBatchItem seo_batch_items Individual page prompt + result within a batch

File Index

Workers

File Purpose
app/workers/seo_batch_collector_worker.rb Phase 1: collect prompts for batch API
app/workers/seo_batch_submit_worker.rb Phase 2a: submit to Gemini/Anthropic Batch API
app/workers/seo_batch_poll_worker.rb Phase 2b: poll batch API until completion
app/workers/seo_batch_results_worker.rb Phase 2c: process results, save reports
app/workers/seo_page_analysis_worker.rb On-demand full SEO analysis (crawl + sync + AI)
app/workers/seo_metrics_sync_worker.rb Orchestrates nightly metric syncs
app/workers/seo_visits_sync_worker.rb Sync first-party visit counts
app/workers/seo_gsc_sync_worker.rb Sync Google Search Console data
app/workers/seo_ga4_sync_worker.rb Sync Google Analytics 4 data
app/workers/seo_ahrefs_sync_worker.rb Sync Ahrefs keyword and traffic data
app/workers/site_map_content_extraction_worker.rb Nightly page crawl for content and schema

Services

File Purpose
app/services/seo/page_analysis_service.rb AI analysis (prompts, config, real-time execution)
app/services/seo/recommendation_extractor_service.rb Extract recommendations from seo_report
app/services/seo/gemini_batch_client.rb Gemini Batch API client (Faraday)
app/services/seo/anthropic_batch_client.rb Anthropic Message Batches API client (Faraday)
app/services/seo/visits_sync_service.rb First-party visit sync
app/services/seo/gsc_sync_service.rb GSC metrics sync
app/services/seo/ga4_sync_service.rb GA4 metrics sync
app/services/seo/ahrefs_sync_service.rb Ahrefs metrics sync
app/services/seo/keyword_sync_service.rb Per-page keyword sync (GSC + Ahrefs + Planner)
app/services/seo/gsc_keyword_sync_service.rb GSC keyword rankings for a single page
app/services/seo/cannibalization_service.rb Keyword cannibalization detection
app/services/seo/mcp_clients.rb Factory for Ahrefs and GSC API clients

Controllers

File Purpose
app/controllers/crm/site_maps_controller.rb SiteMap CRUD, analysis, sync actions
app/controllers/crm/site_map_recommendations_controller.rb Recommendation status management
app/controllers/crm/seo_keywords_controller.rb Keyword overview and detail
app/controllers/crm/metrics_analysis_controller.rb Time-series metric charts