SEO System Architecture
Comprehensive reference for the SEO analysis, metrics, and optimization features.
Related docs:
- SEO Batch API Pipeline — detailed batch API notes, cost analysis, caching
- SEO Metrics Data Model — time-series data model for metrics
- SEO Tasks — actionable SEO tasks from keyword analysis
System Overview
Section titled “System Overview”The SEO system has four pillars:
- Data Collection — nightly syncs from first-party visits, GSC, GA4, and Ahrefs
- Content Crawling — nightly extraction of page content, JSON-LD schema, and internal links
- AI Analysis — real-time (on-demand) and batch (nightly) AI-powered SEO analysis
- Recommendations — extracted, trackable action items from AI analysis
Nightly Pipeline
Section titled “Nightly Pipeline”All SEO jobs run on a staggered schedule to ensure data is fresh before analysis:
1:00 AM SiteMapContentExtractionWorker Crawl pages for content, schema, links1:10 AM Sitemap::SitemapGenerator Generate public sitemap XML2:00 AM SeoVisitsSyncWorker Sync first-party visit counts2:00 AM SeoGscSyncWorker Sync GSC clicks, impressions, CTR2:00 AM SeoGa4SyncWorker Sync GA4 page views, sessions, engagement2:30 AM SeoAhrefsSyncWorker Sync Ahrefs keywords, positions, backlinks2:50 AM TextEmbeddingPopulation (SiteMap) Backfill SiteMap embeddings3:00 AM SeoBatchCollectorWorker Phase 1: build AI prompts (zero tokens) -> SeoBatchSubmitWorker Phase 2a: submit to Gemini Batch API -> SeoBatchPollWorker Phase 2b: poll until complete -> SeoBatchResultsWorker Phase 2c: save results + extract recommendationsData Collection
Section titled “Data Collection”Data Sources
Section titled “Data Sources”| Source | Service | Worker | Data Captured |
|---|---|---|---|
| First-party visits | Seo::VisitsSyncService | SeoVisitsSyncWorker | 30-day page view counts from Visits table |
| Google Search Console | Seo::GscSyncService | SeoGscSyncWorker | Clicks, impressions, CTR, avg position (28-day) |
| Google Analytics 4 | Seo::Ga4SyncService | SeoGa4SyncWorker | Page views, sessions, users, bounce rate, engagement rate |
| Ahrefs | Seo::AhrefsSyncService | SeoAhrefsSyncWorker | Organic traffic, keyword count, positions, traffic value |
| Ahrefs (per-page) | Seo::KeywordSyncService | — | Organic keywords + Google Keyword Planner volume |
API Clients
Section titled “API Clients”| Client | Auth | Purpose |
|---|---|---|
Seo::GscApiClient | Google service account | GSC Search Analytics API |
Seo::Ga4ApiClient | Google service account | GA4 Data API |
Seo::AhrefsApiClient | API key (v3 REST) | Ahrefs organic keywords, top pages |
Seo::AhrefsMcpClient | MCP JSON-RPC | Ahrefs via MCP server |
Seo::GoogleKeywordPlannerClient | Google Ads API | Search volume for keywords |
Seo::KeywordsPeopleUseClient | API key | People Also Ask, autocomplete |
Storage
Section titled “Storage”Metrics are stored in two places:
site_map_data_points— time-series fact table (source of truth for trends)- Cached columns on
site_maps—visit_count_30d,seo_clicks,seo_traffic, etc. (for SQL sorting/filtering)
See SEO Metrics Data Model for schema details.
Content Crawling
Section titled “Content Crawling”Worker: SiteMapContentExtractionWorker (1 AM nightly)
Crawls cacheable pages via Cache::SiteCrawler and stores:
| Column | Content |
|---|---|
extracted_content | Sanitized page text (headings, paragraphs, lists) |
extracted_title | Page <title> tag |
rendered_schema | JSON-LD structured data found on the page |
extracted_at | Timestamp of last crawl |
Also discovers and stores the internal link graph in SiteMapLink records (outbound/inbound editorial links between pages).
AI Analysis
Section titled “AI Analysis”Shared Configuration
Section titled “Shared Configuration”All analysis paths (real-time, Gemini batch, Anthropic batch, sequential fallback) derive their config from constants on Seo::PageAnalysisService:
MAX_OUTPUT_TOKENS = 16_384TEMPERATURE = 0.3THINKING_BUDGET = 10_000 # premium only (Claude Opus)ANALYSIS_MODEL = AiModelConstants.id(:seo_analysis) # gemini-3.5-flashANALYSIS_MODEL_PREMIUM = AiModelConstants.id(:anthropic_opus) # claude-opus-4-8Real-Time Analysis (On-Demand)
Section titled “Real-Time Analysis (On-Demand)”Triggered from the CRM via SeoPageAnalysisWorker:
CRM "Analyze" button -> SeoPageAnalysisWorker.perform_async(site_map_id) -> Crawl page (if stale >24h) -> Sync visits, GSC, GA4, keywords -> PageAnalysisService#generate_analysis -> RubyLLM.chat(model: gemini-3.5-flash) -> Provider-aware params (Gemini: generationConfig.maxOutputTokens, Anthropic: max_tokens) -> Structured JSON output via ANALYSIS_SCHEMA -> Save to SiteMap#seo_report -> Extract recommendations via RecommendationExtractorServiceFull mode runs 8 steps (crawl + 4 syncs + gather context + AI + save).
Analysis-only mode (skip_syncs: true) runs 3 steps.
Nightly Batch Analysis (50% Cost)
Section titled “Nightly Batch Analysis (50% Cost)”A 4-worker pipeline using the Gemini Batch API:
Phase 1 — Collect (SeoBatchCollectorWorker, 3 AM):
- Finds up to 500 pages needing analysis using tiered freshness:
- High traffic (100+): every 7 days
- Moderate (10-99): every 14 days
- Low (1-9): every 30 days
- Syncs Ahrefs keywords for each page
- Builds prompts via
PageAnalysisService#build_prompts(zero AI tokens) - Stores prompts in
SeoBatchItemrecords
Phase 2a — Submit (SeoBatchSubmitWorker):
- Routes by model prefix:
gemini-*→ Gemini Batch API (Seo::GeminiBatchClient)claude-*→ Anthropic Message Batches API (Seo::AnthropicBatchClient)- Other → sequential RubyLLM fallback
- Creates a Gemini context cache for the shared system prompt (75% savings on cached tokens)
- Submits all requests in a single HTTP POST
Phase 2b — Poll (SeoBatchPollWorker):
- Self-re-enqueues with exponential backoff: 2 min → 4 min → 8 min → 15 min (capped)
- Max polling: 24 hours (120 attempts)
- Terminal states:
JOB_STATE_SUCCEEDED,JOB_STATE_FAILED,JOB_STATE_CANCELLED,JOB_STATE_EXPIRED
Phase 2c — Results (SeoBatchResultsWorker):
- Parses each response (inline or file-based for large batches)
- Saves analysis to
SeoBatchItem#resultandSiteMap#seo_report - Runs
RecommendationExtractorServicefor each page - Deletes Gemini context cache to stop storage charges
Analysis Output
Section titled “Analysis Output”The AI returns a structured JSON object (ANALYSIS_SCHEMA) containing:
| Section | Content |
|---|---|
current_state_analysis | Title, content, keyword, and schema analysis (chain-of-thought) |
overall_score | 0-100 score |
summary | 2-3 sentence overview |
competitive_position | leading / competitive / average / struggling |
strengths | Findings with evidence |
opportunities | Gaps with evidence and recommendations |
keyword_strategy | Primary/secondary keywords, title action |
internal_linking | Recommended links with anchor text and placement |
faq_recommendations | FAQs to add (with FAQ IDs) |
people_also_ask_content | PAA questions with suggested answers |
content_recommendations | Content improvements with AI search benefit |
technical_recommendations | Technical fixes (array of strings) |
structured_data_recommendations | Schema.org improvements with AIO benefit |
aio_recommendations | AI Overview / GEO optimization recommendations |
priority_actions | Top actions ranked by impact and effort |
Results are stored in SiteMap#seo_report (JSONB).
Recommendations
Section titled “Recommendations”Service: Seo::RecommendationExtractorService
Model: SiteMapRecommendation
After each analysis, recommendations are extracted from the seo_report into individual SiteMapRecommendation records for tracking:
| Field | Purpose |
|---|---|
category | priority_action, internal_linking, faq_recommendation, content_recommendation, technical_recommendation, structured_data, aio_recommendation, people_also_ask |
status | pending → accepted → in_progress → completed (or ignored, stale) |
fingerprint | Deduplication key — same recommendation from re-analysis merges rather than duplicates |
impact / effort | Priority matrix (high/medium/low) |
Managed via Crm::SiteMapRecommendationsController with bulk update support.
Additional SEO Services
Section titled “Additional SEO Services”Cannibalization Detection
Section titled “Cannibalization Detection”Seo::CannibalizationService detects when multiple pages compete for the same keyword, causing Google to switch ranking URLs.
Link Auditing
Section titled “Link Auditing”| Service | Purpose |
|---|---|
Seo::ArticleLinkAuditor | Audits both internal and external links in articles |
Seo::InternalLinkValidator | Validates internal links, upserts editorial links into SiteMapLink |
Seo::LinkAnalyzer | Checks external URLs for status codes and redirects |
Content Sanitization
Section titled “Content Sanitization”| Service | Purpose |
|---|---|
Seo::HtmlContentSanitizer | Cleans empty elements, inline styles, table classes |
Seo::HtmlLinkSanitizer | Normalizes link URLs (locale prefixes, hostnames) |
Seo::HtmlHeadingSanitizer | Normalizes heading tag hierarchy (h1-h6) |
Seo::DeparameterizeLinks | Strips query parameters from internal links |
Seo::HtmlPrettyPrinter | Formats HTML with HtmlBeautifier |
Seo::ImageOptimizer | Adds loading="lazy" to images |
Seo::ImageMissingSizeFiller | Fills missing width/height attributes |
CRM Interface
Section titled “CRM Interface”| Route | View | Purpose |
|---|---|---|
/crm/site_maps | Index | Filterable list of all site maps with scores |
/crm/site_maps/:id | Show | Full SEO report with recommendations |
/crm/site_maps/action_items | Action Items | All pending recommendations across pages |
/crm/seo_keywords | Keywords | Overview of all tracked keywords |
/crm/seo_keywords/:id | Keyword Detail | Pages ranking for a specific keyword |
/crm/metrics_analysis | Metrics | Time-series charts for SEO metrics |
CRM Actions
Section titled “CRM Actions”| Action | Endpoint | Effect |
|---|---|---|
| Analyze (full) | POST /crm/site_maps/:id/analyze | Queues SeoPageAnalysisWorker with all syncs |
| Analyze (AI only) | POST /crm/site_maps/:id/analyze_only | Queues with skip_syncs: true |
| Analyze (premium) | POST /crm/site_maps/:id/analyze_premium | Uses Claude Opus with extended thinking |
| Sync Keywords | POST /crm/site_maps/:id/sync_keywords | Runs KeywordSyncService inline |
| Sync Visits | POST /crm/site_maps/:id/sync_visits | Runs VisitsSyncService inline |
| Sync GSC | POST /crm/site_maps/:id/sync_gsc | Runs GSC sync inline |
| Sync GA4 | POST /crm/site_maps/:id/sync_ga4 | Runs GA4 sync inline |
| Recrawl | POST /crm/site_maps/:id/recrawl | Re-crawls page content and schema |
Key Models
Section titled “Key Models”| Model | Table | Purpose |
|---|---|---|
SiteMap | site_maps | Pages tracked for SEO (URL, content, report, metrics) |
SiteMapDataPoint | site_map_data_points | Time-series metrics (fact table) |
SiteMapRecommendation | site_map_recommendations | Extracted action items with status tracking |
SiteMapLink | site_map_links | Internal link graph (outbound/inbound editorial) |
SeoPageKeyword | seo_page_keywords | Keywords tracked per page (position, volume, source) |
SeoBatchJob | seo_batch_jobs | Batch analysis job (status, provider, metadata) |
SeoBatchItem | seo_batch_items | Individual page prompt + result within a batch |
File Index
Section titled “File Index”Workers
Section titled “Workers”| File | Purpose |
|---|---|
app/workers/seo_batch_collector_worker.rb | Phase 1: collect prompts for batch API |
app/workers/seo_batch_submit_worker.rb | Phase 2a: submit to Gemini/Anthropic Batch API |
app/workers/seo_batch_poll_worker.rb | Phase 2b: poll batch API until completion |
app/workers/seo_batch_results_worker.rb | Phase 2c: process results, save reports |
app/workers/seo_page_analysis_worker.rb | On-demand full SEO analysis (crawl + sync + AI) |
app/workers/seo_metrics_sync_worker.rb | Orchestrates nightly metric syncs |
app/workers/seo_visits_sync_worker.rb | Sync first-party visit counts |
app/workers/seo_gsc_sync_worker.rb | Sync Google Search Console data |
app/workers/seo_ga4_sync_worker.rb | Sync Google Analytics 4 data |
app/workers/seo_ahrefs_sync_worker.rb | Sync Ahrefs keyword and traffic data |
app/workers/site_map_content_extraction_worker.rb | Nightly page crawl for content and schema |
Services
Section titled “Services”| File | Purpose |
|---|---|
app/services/seo/page_analysis_service.rb | AI analysis (prompts, config, real-time execution) |
app/services/seo/recommendation_extractor_service.rb | Extract recommendations from seo_report |
app/services/seo/gemini_batch_client.rb | Gemini Batch API client (Faraday) |
app/services/seo/anthropic_batch_client.rb | Anthropic Message Batches API client (Faraday) |
app/services/seo/visits_sync_service.rb | First-party visit sync |
app/services/seo/gsc_sync_service.rb | GSC metrics sync |
app/services/seo/ga4_sync_service.rb | GA4 metrics sync |
app/services/seo/ahrefs_sync_service.rb | Ahrefs metrics sync |
app/services/seo/keyword_sync_service.rb | Per-page keyword sync (GSC + Ahrefs + Planner) |
app/services/seo/gsc_keyword_sync_service.rb | GSC keyword rankings for a single page |
app/services/seo/cannibalization_service.rb | Keyword cannibalization detection |
app/services/seo/mcp_clients.rb | Factory for Ahrefs and GSC API clients |
Controllers
Section titled “Controllers”| File | Purpose |
|---|---|
app/controllers/crm/site_maps_controller.rb | SiteMap CRUD, analysis, sync actions |
app/controllers/crm/site_map_recommendations_controller.rb | Recommendation status management |
app/controllers/crm/seo_keywords_controller.rb | Keyword overview and detail |
app/controllers/crm/metrics_analysis_controller.rb | Time-series metric charts |