SEO System Architecture
Comprehensive reference for the SEO analysis, metrics, and optimization features.
Related docs:
- SEO Batch API Pipeline — detailed batch API notes, cost analysis, caching
- SEO Metrics Data Model — time-series data model for metrics
- SEO Tasks — actionable SEO tasks from keyword analysis
System Overview
The SEO system has four pillars:
- Data Collection — nightly syncs from first-party visits, GSC, GA4, and Ahrefs
- Content Crawling — nightly extraction of page content, JSON-LD schema, and internal links
- AI Analysis — real-time (on-demand) and batch (nightly) AI-powered SEO analysis
- Recommendations — extracted, trackable action items from AI analysis
Nightly Pipeline
All SEO jobs run on a staggered schedule to ensure data is fresh before analysis:
1:00 AM SiteMapContentExtractionWorker Crawl pages for content, schema, links
1:10 AM Sitemap::SitemapGenerator Generate public sitemap XML
2:00 AM SeoVisitsSyncWorker Sync first-party visit counts
2:00 AM SeoGscSyncWorker Sync GSC clicks, impressions, CTR
2:00 AM SeoGa4SyncWorker Sync GA4 page views, sessions, engagement
2:30 AM SeoAhrefsSyncWorker Sync Ahrefs keywords, positions, backlinks
2:50 AM TextEmbeddingPopulation (SiteMap) Backfill SiteMap embeddings
3:00 AM SeoBatchCollectorWorker Phase 1: build AI prompts (zero tokens)
-> SeoBatchSubmitWorker Phase 2a: submit to Gemini Batch API
-> SeoBatchPollWorker Phase 2b: poll until complete
-> SeoBatchResultsWorker Phase 2c: save results + extract recommendations
Data Collection
Data Sources
| Source | Service | Worker | Data Captured |
|---|---|---|---|
| First-party visits | Seo::VisitsSyncService |
SeoVisitsSyncWorker |
30-day page view counts from Visits table |
| Google Search Console | Seo::GscSyncService |
SeoGscSyncWorker |
Clicks, impressions, CTR, avg position (28-day) |
| Google Analytics 4 | Seo::Ga4SyncService |
SeoGa4SyncWorker |
Page views, sessions, users, bounce rate, engagement rate |
| Ahrefs | Seo::AhrefsSyncService |
SeoAhrefsSyncWorker |
Organic traffic, keyword count, positions, traffic value |
| Ahrefs (per-page) | Seo::KeywordSyncService |
— | Organic keywords + Google Keyword Planner volume |
API Clients
| Client | Auth | Purpose |
|---|---|---|
Seo::GscApiClient |
Google service account | GSC Search Analytics API |
Seo::Ga4ApiClient |
Google service account | GA4 Data API |
Seo::AhrefsApiClient |
API key (v3 REST) | Ahrefs organic keywords, top pages |
Seo::AhrefsMcpClient |
MCP JSON-RPC | Ahrefs via MCP server |
Seo::GoogleKeywordPlannerClient |
Google Ads API | Search volume for keywords |
Seo::KeywordsPeopleUseClient |
API key | People Also Ask, autocomplete |
Storage
Metrics are stored in two places:
site_map_data_points— time-series fact table (source of truth for trends)- Cached columns on
site_maps—visit_count_30d,seo_clicks,seo_traffic, etc. (for SQL sorting/filtering)
See SEO Metrics Data Model for schema details.
Content Crawling
Worker: SiteMapContentExtractionWorker (1 AM nightly)
Crawls cacheable pages via Cache::SiteCrawler and stores:
| Column | Content |
|---|---|
extracted_content |
Sanitized page text (headings, paragraphs, lists) |
extracted_title |
Page <title> tag |
rendered_schema |
JSON-LD structured data found on the page |
extracted_at |
Timestamp of last crawl |
Also discovers and stores the internal link graph in SiteMapLink records (outbound/inbound editorial links between pages).
AI Analysis
Shared Configuration
All analysis paths (real-time, Gemini batch, Anthropic batch, sequential fallback) derive their config from constants on Seo::PageAnalysisService:
MAX_OUTPUT_TOKENS = 16_384
TEMPERATURE = 0.3
THINKING_BUDGET = 10_000 # premium only (Claude Opus)
ANALYSIS_MODEL = AiModelConstants.id(:seo_analysis) # gemini-3.5-flash
ANALYSIS_MODEL_PREMIUM = AiModelConstants.id(:anthropic_opus) # claude-opus-4-8
Real-Time Analysis (On-Demand)
Triggered from the CRM via SeoPageAnalysisWorker:
CRM "Analyze" button
-> SeoPageAnalysisWorker.perform_async(site_map_id)
-> Crawl page (if stale >24h)
-> Sync visits, GSC, GA4, keywords
-> PageAnalysisService#generate_analysis
-> RubyLLM.chat(model: gemini-3.5-flash)
-> Provider-aware params (Gemini: generationConfig.maxOutputTokens, Anthropic: max_tokens)
-> Structured JSON output via ANALYSIS_SCHEMA
-> Save to SiteMap#seo_report
-> Extract recommendations via RecommendationExtractorService
Full mode runs 8 steps (crawl + 4 syncs + gather context + AI + save).
Analysis-only mode (skip_syncs: true) runs 3 steps.
Nightly Batch Analysis (50% Cost)
A 4-worker pipeline using the Gemini Batch API:
Phase 1 — Collect (SeoBatchCollectorWorker, 3 AM):
- Finds up to 500 pages needing analysis using tiered freshness:
- High traffic (100+): every 7 days
- Moderate (10-99): every 14 days
- Low (1-9): every 30 days
- Syncs Ahrefs keywords for each page
- Builds prompts via
PageAnalysisService#build_prompts(zero AI tokens) - Stores prompts in
SeoBatchItemrecords
Phase 2a — Submit (SeoBatchSubmitWorker):
- Routes by model prefix:
gemini-*→ Gemini Batch API (Seo::GeminiBatchClient)claude-*→ Anthropic Message Batches API (Seo::AnthropicBatchClient)- Other → sequential RubyLLM fallback
- Creates a Gemini context cache for the shared system prompt (75% savings on cached tokens)
- Submits all requests in a single HTTP POST
Phase 2b — Poll (SeoBatchPollWorker):
- Self-re-enqueues with exponential backoff: 2 min → 4 min → 8 min → 15 min (capped)
- Max polling: 24 hours (120 attempts)
- Terminal states:
JOB_STATE_SUCCEEDED,JOB_STATE_FAILED,JOB_STATE_CANCELLED,JOB_STATE_EXPIRED
Phase 2c — Results (SeoBatchResultsWorker):
- Parses each response (inline or file-based for large batches)
- Saves analysis to
SeoBatchItem#resultandSiteMap#seo_report - Runs
RecommendationExtractorServicefor each page - Deletes Gemini context cache to stop storage charges
Analysis Output
The AI returns a structured JSON object (ANALYSIS_SCHEMA) containing:
| Section | Content |
|---|---|
current_state_analysis |
Title, content, keyword, and schema analysis (chain-of-thought) |
overall_score |
0-100 score |
summary |
2-3 sentence overview |
competitive_position |
leading / competitive / average / struggling |
strengths |
Findings with evidence |
opportunities |
Gaps with evidence and recommendations |
keyword_strategy |
Primary/secondary keywords, title action |
internal_linking |
Recommended links with anchor text and placement |
faq_recommendations |
FAQs to add (with FAQ IDs) |
people_also_ask_content |
PAA questions with suggested answers |
content_recommendations |
Content improvements with AI search benefit |
technical_recommendations |
Technical fixes (array of strings) |
structured_data_recommendations |
Schema.org improvements with AIO benefit |
aio_recommendations |
AI Overview / GEO optimization recommendations |
priority_actions |
Top actions ranked by impact and effort |
Results are stored in SiteMap#seo_report (JSONB).
Recommendations
Service: Seo::RecommendationExtractorService
Model: SiteMapRecommendation
After each analysis, recommendations are extracted from the seo_report into individual SiteMapRecommendation records for tracking:
| Field | Purpose |
|---|---|
category |
priority_action, internal_linking, faq_recommendation, content_recommendation, technical_recommendation, structured_data, aio_recommendation, people_also_ask |
status |
pending → accepted → in_progress → completed (or ignored, stale) |
fingerprint |
Deduplication key — same recommendation from re-analysis merges rather than duplicates |
impact / effort |
Priority matrix (high/medium/low) |
Managed via Crm::SiteMapRecommendationsController with bulk update support.
Additional SEO Services
Cannibalization Detection
Seo::CannibalizationService detects when multiple pages compete for the same keyword, causing Google to switch ranking URLs.
Link Auditing
| Service | Purpose |
|---|---|
Seo::ArticleLinkAuditor |
Audits both internal and external links in articles |
Seo::InternalLinkValidator |
Validates internal links, upserts editorial links into SiteMapLink |
Seo::LinkAnalyzer |
Checks external URLs for status codes and redirects |
Content Sanitization
| Service | Purpose |
|---|---|
Seo::HtmlContentSanitizer |
Cleans empty elements, inline styles, table classes |
Seo::HtmlLinkSanitizer |
Normalizes link URLs (locale prefixes, hostnames) |
Seo::HtmlHeadingSanitizer |
Normalizes heading tag hierarchy (h1-h6) |
Seo::DeparameterizeLinks |
Strips query parameters from internal links |
Seo::HtmlPrettyPrinter |
Formats HTML with HtmlBeautifier |
Seo::ImageOptimizer |
Adds loading="lazy" to images |
Seo::ImageMissingSizeFiller |
Fills missing width/height attributes |
CRM Interface
Pages
| Route | View | Purpose |
|---|---|---|
/crm/site_maps |
Index | Filterable list of all site maps with scores |
/crm/site_maps/:id |
Show | Full SEO report with recommendations |
/crm/site_maps/action_items |
Action Items | All pending recommendations across pages |
/crm/seo_keywords |
Keywords | Overview of all tracked keywords |
/crm/seo_keywords/:id |
Keyword Detail | Pages ranking for a specific keyword |
/crm/metrics_analysis |
Metrics | Time-series charts for SEO metrics |
CRM Actions
| Action | Endpoint | Effect |
|---|---|---|
| Analyze (full) | POST /crm/site_maps/:id/analyze |
Queues SeoPageAnalysisWorker with all syncs |
| Analyze (AI only) | POST /crm/site_maps/:id/analyze_only |
Queues with skip_syncs: true |
| Analyze (premium) | POST /crm/site_maps/:id/analyze_premium |
Uses Claude Opus with extended thinking |
| Sync Keywords | POST /crm/site_maps/:id/sync_keywords |
Runs KeywordSyncService inline |
| Sync Visits | POST /crm/site_maps/:id/sync_visits |
Runs VisitsSyncService inline |
| Sync GSC | POST /crm/site_maps/:id/sync_gsc |
Runs GSC sync inline |
| Sync GA4 | POST /crm/site_maps/:id/sync_ga4 |
Runs GA4 sync inline |
| Recrawl | POST /crm/site_maps/:id/recrawl |
Re-crawls page content and schema |
Key Models
| Model | Table | Purpose |
|---|---|---|
SiteMap |
site_maps |
Pages tracked for SEO (URL, content, report, metrics) |
SiteMapDataPoint |
site_map_data_points |
Time-series metrics (fact table) |
SiteMapRecommendation |
site_map_recommendations |
Extracted action items with status tracking |
SiteMapLink |
site_map_links |
Internal link graph (outbound/inbound editorial) |
SeoPageKeyword |
seo_page_keywords |
Keywords tracked per page (position, volume, source) |
SeoBatchJob |
seo_batch_jobs |
Batch analysis job (status, provider, metadata) |
SeoBatchItem |
seo_batch_items |
Individual page prompt + result within a batch |
File Index
Workers
| File | Purpose |
|---|---|
app/workers/seo_batch_collector_worker.rb |
Phase 1: collect prompts for batch API |
app/workers/seo_batch_submit_worker.rb |
Phase 2a: submit to Gemini/Anthropic Batch API |
app/workers/seo_batch_poll_worker.rb |
Phase 2b: poll batch API until completion |
app/workers/seo_batch_results_worker.rb |
Phase 2c: process results, save reports |
app/workers/seo_page_analysis_worker.rb |
On-demand full SEO analysis (crawl + sync + AI) |
app/workers/seo_metrics_sync_worker.rb |
Orchestrates nightly metric syncs |
app/workers/seo_visits_sync_worker.rb |
Sync first-party visit counts |
app/workers/seo_gsc_sync_worker.rb |
Sync Google Search Console data |
app/workers/seo_ga4_sync_worker.rb |
Sync Google Analytics 4 data |
app/workers/seo_ahrefs_sync_worker.rb |
Sync Ahrefs keyword and traffic data |
app/workers/site_map_content_extraction_worker.rb |
Nightly page crawl for content and schema |
Services
| File | Purpose |
|---|---|
app/services/seo/page_analysis_service.rb |
AI analysis (prompts, config, real-time execution) |
app/services/seo/recommendation_extractor_service.rb |
Extract recommendations from seo_report |
app/services/seo/gemini_batch_client.rb |
Gemini Batch API client (Faraday) |
app/services/seo/anthropic_batch_client.rb |
Anthropic Message Batches API client (Faraday) |
app/services/seo/visits_sync_service.rb |
First-party visit sync |
app/services/seo/gsc_sync_service.rb |
GSC metrics sync |
app/services/seo/ga4_sync_service.rb |
GA4 metrics sync |
app/services/seo/ahrefs_sync_service.rb |
Ahrefs metrics sync |
app/services/seo/keyword_sync_service.rb |
Per-page keyword sync (GSC + Ahrefs + Planner) |
app/services/seo/gsc_keyword_sync_service.rb |
GSC keyword rankings for a single page |
app/services/seo/cannibalization_service.rb |
Keyword cannibalization detection |
app/services/seo/mcp_clients.rb |
Factory for Ahrefs and GSC API clients |
Controllers
| File | Purpose |
|---|---|
app/controllers/crm/site_maps_controller.rb |
SiteMap CRUD, analysis, sync actions |
app/controllers/crm/site_map_recommendations_controller.rb |
Recommendation status management |
app/controllers/crm/seo_keywords_controller.rb |
Keyword overview and detail |
app/controllers/crm/metrics_analysis_controller.rb |
Time-series metric charts |