Skip to content

SEO System Architecture

Comprehensive reference for the SEO analysis, metrics, and optimization features.

Related docs:

  • SEO Batch API Pipeline — detailed batch API notes, cost analysis, caching
  • SEO Metrics Data Model — time-series data model for metrics
  • SEO Tasks — actionable SEO tasks from keyword analysis

The SEO system has four pillars:

  1. Data Collection — nightly syncs from first-party visits, GSC, GA4, and Ahrefs
  2. Content Crawling — nightly extraction of page content, JSON-LD schema, and internal links
  3. AI Analysis — real-time (on-demand) and batch (nightly) AI-powered SEO analysis
  4. Recommendations — extracted, trackable action items from AI analysis

All SEO jobs run on a staggered schedule to ensure data is fresh before analysis:

1:00 AM SiteMapContentExtractionWorker Crawl pages for content, schema, links
1:10 AM Sitemap::SitemapGenerator Generate public sitemap XML
2:00 AM SeoVisitsSyncWorker Sync first-party visit counts
2:00 AM SeoGscSyncWorker Sync GSC clicks, impressions, CTR
2:00 AM SeoGa4SyncWorker Sync GA4 page views, sessions, engagement
2:30 AM SeoAhrefsSyncWorker Sync Ahrefs keywords, positions, backlinks
2:50 AM TextEmbeddingPopulation (SiteMap) Backfill SiteMap embeddings
3:00 AM SeoBatchCollectorWorker Phase 1: build AI prompts (zero tokens)
-> SeoBatchSubmitWorker Phase 2a: submit to Gemini Batch API
-> SeoBatchPollWorker Phase 2b: poll until complete
-> SeoBatchResultsWorker Phase 2c: save results + extract recommendations
SourceServiceWorkerData Captured
First-party visitsSeo::VisitsSyncServiceSeoVisitsSyncWorker30-day page view counts from Visits table
Google Search ConsoleSeo::GscSyncServiceSeoGscSyncWorkerClicks, impressions, CTR, avg position (28-day)
Google Analytics 4Seo::Ga4SyncServiceSeoGa4SyncWorkerPage views, sessions, users, bounce rate, engagement rate
AhrefsSeo::AhrefsSyncServiceSeoAhrefsSyncWorkerOrganic traffic, keyword count, positions, traffic value
Ahrefs (per-page)Seo::KeywordSyncServiceOrganic keywords + Google Keyword Planner volume
ClientAuthPurpose
Seo::GscApiClientGoogle service accountGSC Search Analytics API
Seo::Ga4ApiClientGoogle service accountGA4 Data API
Seo::AhrefsApiClientAPI key (v3 REST)Ahrefs organic keywords, top pages
Seo::AhrefsMcpClientMCP JSON-RPCAhrefs via MCP server
Seo::GoogleKeywordPlannerClientGoogle Ads APISearch volume for keywords
Seo::KeywordsPeopleUseClientAPI keyPeople Also Ask, autocomplete

Metrics are stored in two places:

  1. site_map_data_points — time-series fact table (source of truth for trends)
  2. Cached columns on site_mapsvisit_count_30d, seo_clicks, seo_traffic, etc. (for SQL sorting/filtering)

See SEO Metrics Data Model for schema details.

Worker: SiteMapContentExtractionWorker (1 AM nightly)

Crawls cacheable pages via Cache::SiteCrawler and stores:

ColumnContent
extracted_contentSanitized page text (headings, paragraphs, lists)
extracted_titlePage <title> tag
rendered_schemaJSON-LD structured data found on the page
extracted_atTimestamp of last crawl

Also discovers and stores the internal link graph in SiteMapLink records (outbound/inbound editorial links between pages).

All analysis paths (real-time, Gemini batch, Anthropic batch, sequential fallback) derive their config from constants on Seo::PageAnalysisService:

MAX_OUTPUT_TOKENS = 16_384
TEMPERATURE = 0.3
THINKING_BUDGET = 10_000 # premium only (Claude Opus)
ANALYSIS_MODEL = AiModelConstants.id(:seo_analysis) # gemini-3.5-flash
ANALYSIS_MODEL_PREMIUM = AiModelConstants.id(:anthropic_opus) # claude-opus-4-8

Triggered from the CRM via SeoPageAnalysisWorker:

CRM "Analyze" button
-> SeoPageAnalysisWorker.perform_async(site_map_id)
-> Crawl page (if stale >24h)
-> Sync visits, GSC, GA4, keywords
-> PageAnalysisService#generate_analysis
-> RubyLLM.chat(model: gemini-3.5-flash)
-> Provider-aware params (Gemini: generationConfig.maxOutputTokens, Anthropic: max_tokens)
-> Structured JSON output via ANALYSIS_SCHEMA
-> Save to SiteMap#seo_report
-> Extract recommendations via RecommendationExtractorService

Full mode runs 8 steps (crawl + 4 syncs + gather context + AI + save). Analysis-only mode (skip_syncs: true) runs 3 steps.

A 4-worker pipeline using the Gemini Batch API:

Phase 1 — Collect (SeoBatchCollectorWorker, 3 AM):

  • Finds up to 500 pages needing analysis using tiered freshness:
    • High traffic (100+): every 7 days
    • Moderate (10-99): every 14 days
    • Low (1-9): every 30 days
  • Syncs Ahrefs keywords for each page
  • Builds prompts via PageAnalysisService#build_prompts (zero AI tokens)
  • Stores prompts in SeoBatchItem records

Phase 2a — Submit (SeoBatchSubmitWorker):

  • Routes by model prefix:
    • gemini-* → Gemini Batch API (Seo::GeminiBatchClient)
    • claude-* → Anthropic Message Batches API (Seo::AnthropicBatchClient)
    • Other → sequential RubyLLM fallback
  • Creates a Gemini context cache for the shared system prompt (75% savings on cached tokens)
  • Submits all requests in a single HTTP POST

Phase 2b — Poll (SeoBatchPollWorker):

  • Self-re-enqueues with exponential backoff: 2 min → 4 min → 8 min → 15 min (capped)
  • Max polling: 24 hours (120 attempts)
  • Terminal states: JOB_STATE_SUCCEEDED, JOB_STATE_FAILED, JOB_STATE_CANCELLED, JOB_STATE_EXPIRED

Phase 2c — Results (SeoBatchResultsWorker):

  • Parses each response (inline or file-based for large batches)
  • Saves analysis to SeoBatchItem#result and SiteMap#seo_report
  • Runs RecommendationExtractorService for each page
  • Deletes Gemini context cache to stop storage charges

The AI returns a structured JSON object (ANALYSIS_SCHEMA) containing:

SectionContent
current_state_analysisTitle, content, keyword, and schema analysis (chain-of-thought)
overall_score0-100 score
summary2-3 sentence overview
competitive_positionleading / competitive / average / struggling
strengthsFindings with evidence
opportunitiesGaps with evidence and recommendations
keyword_strategyPrimary/secondary keywords, title action
internal_linkingRecommended links with anchor text and placement
faq_recommendationsFAQs to add (with FAQ IDs)
people_also_ask_contentPAA questions with suggested answers
content_recommendationsContent improvements with AI search benefit
technical_recommendationsTechnical fixes (array of strings)
structured_data_recommendationsSchema.org improvements with AIO benefit
aio_recommendationsAI Overview / GEO optimization recommendations
priority_actionsTop actions ranked by impact and effort

Results are stored in SiteMap#seo_report (JSONB).

Service: Seo::RecommendationExtractorService Model: SiteMapRecommendation

After each analysis, recommendations are extracted from the seo_report into individual SiteMapRecommendation records for tracking:

FieldPurpose
categorypriority_action, internal_linking, faq_recommendation, content_recommendation, technical_recommendation, structured_data, aio_recommendation, people_also_ask
statuspendingacceptedin_progresscompleted (or ignored, stale)
fingerprintDeduplication key — same recommendation from re-analysis merges rather than duplicates
impact / effortPriority matrix (high/medium/low)

Managed via Crm::SiteMapRecommendationsController with bulk update support.

Seo::CannibalizationService detects when multiple pages compete for the same keyword, causing Google to switch ranking URLs.

ServicePurpose
Seo::ArticleLinkAuditorAudits both internal and external links in articles
Seo::InternalLinkValidatorValidates internal links, upserts editorial links into SiteMapLink
Seo::LinkAnalyzerChecks external URLs for status codes and redirects
ServicePurpose
Seo::HtmlContentSanitizerCleans empty elements, inline styles, table classes
Seo::HtmlLinkSanitizerNormalizes link URLs (locale prefixes, hostnames)
Seo::HtmlHeadingSanitizerNormalizes heading tag hierarchy (h1-h6)
Seo::DeparameterizeLinksStrips query parameters from internal links
Seo::HtmlPrettyPrinterFormats HTML with HtmlBeautifier
Seo::ImageOptimizerAdds loading="lazy" to images
Seo::ImageMissingSizeFillerFills missing width/height attributes
RouteViewPurpose
/crm/site_mapsIndexFilterable list of all site maps with scores
/crm/site_maps/:idShowFull SEO report with recommendations
/crm/site_maps/action_itemsAction ItemsAll pending recommendations across pages
/crm/seo_keywordsKeywordsOverview of all tracked keywords
/crm/seo_keywords/:idKeyword DetailPages ranking for a specific keyword
/crm/metrics_analysisMetricsTime-series charts for SEO metrics
ActionEndpointEffect
Analyze (full)POST /crm/site_maps/:id/analyzeQueues SeoPageAnalysisWorker with all syncs
Analyze (AI only)POST /crm/site_maps/:id/analyze_onlyQueues with skip_syncs: true
Analyze (premium)POST /crm/site_maps/:id/analyze_premiumUses Claude Opus with extended thinking
Sync KeywordsPOST /crm/site_maps/:id/sync_keywordsRuns KeywordSyncService inline
Sync VisitsPOST /crm/site_maps/:id/sync_visitsRuns VisitsSyncService inline
Sync GSCPOST /crm/site_maps/:id/sync_gscRuns GSC sync inline
Sync GA4POST /crm/site_maps/:id/sync_ga4Runs GA4 sync inline
RecrawlPOST /crm/site_maps/:id/recrawlRe-crawls page content and schema
ModelTablePurpose
SiteMapsite_mapsPages tracked for SEO (URL, content, report, metrics)
SiteMapDataPointsite_map_data_pointsTime-series metrics (fact table)
SiteMapRecommendationsite_map_recommendationsExtracted action items with status tracking
SiteMapLinksite_map_linksInternal link graph (outbound/inbound editorial)
SeoPageKeywordseo_page_keywordsKeywords tracked per page (position, volume, source)
SeoBatchJobseo_batch_jobsBatch analysis job (status, provider, metadata)
SeoBatchItemseo_batch_itemsIndividual page prompt + result within a batch
FilePurpose
app/workers/seo_batch_collector_worker.rbPhase 1: collect prompts for batch API
app/workers/seo_batch_submit_worker.rbPhase 2a: submit to Gemini/Anthropic Batch API
app/workers/seo_batch_poll_worker.rbPhase 2b: poll batch API until completion
app/workers/seo_batch_results_worker.rbPhase 2c: process results, save reports
app/workers/seo_page_analysis_worker.rbOn-demand full SEO analysis (crawl + sync + AI)
app/workers/seo_metrics_sync_worker.rbOrchestrates nightly metric syncs
app/workers/seo_visits_sync_worker.rbSync first-party visit counts
app/workers/seo_gsc_sync_worker.rbSync Google Search Console data
app/workers/seo_ga4_sync_worker.rbSync Google Analytics 4 data
app/workers/seo_ahrefs_sync_worker.rbSync Ahrefs keyword and traffic data
app/workers/site_map_content_extraction_worker.rbNightly page crawl for content and schema
FilePurpose
app/services/seo/page_analysis_service.rbAI analysis (prompts, config, real-time execution)
app/services/seo/recommendation_extractor_service.rbExtract recommendations from seo_report
app/services/seo/gemini_batch_client.rbGemini Batch API client (Faraday)
app/services/seo/anthropic_batch_client.rbAnthropic Message Batches API client (Faraday)
app/services/seo/visits_sync_service.rbFirst-party visit sync
app/services/seo/gsc_sync_service.rbGSC metrics sync
app/services/seo/ga4_sync_service.rbGA4 metrics sync
app/services/seo/ahrefs_sync_service.rbAhrefs metrics sync
app/services/seo/keyword_sync_service.rbPer-page keyword sync (GSC + Ahrefs + Planner)
app/services/seo/gsc_keyword_sync_service.rbGSC keyword rankings for a single page
app/services/seo/cannibalization_service.rbKeyword cannibalization detection
app/services/seo/mcp_clients.rbFactory for Ahrefs and GSC API clients
FilePurpose
app/controllers/crm/site_maps_controller.rbSiteMap CRUD, analysis, sync actions
app/controllers/crm/site_map_recommendations_controller.rbRecommendation status management
app/controllers/crm/seo_keywords_controller.rbKeyword overview and detail
app/controllers/crm/metrics_analysis_controller.rbTime-series metric charts