Class: Seo::InternalLinkValidator
- Inherits:
-
BaseService
- Object
- BaseService
- Seo::InternalLinkValidator
- Defined in:
- app/services/seo/internal_link_validator.rb
Overview
Validates that internal WarmlyYours links in HTML content point to real pages.
Three-tier validation strategy:
Tier 1 (DB lookup, fast): SiteMap path lookup (active OR archived-but-200) + Post/slug check
Tier 2 (HTTP ping, resilient): HEAD request with retry for pages not yet in the sitemap
Tier 3 (Suggestion): Legacy URL pattern resolution via BrokenLinkRedirectMap (DB-backed)
Also provides editorial link extraction for populating the SiteMapLink graph
at save time rather than waiting for the nightly crawler.
Usage:
result = Seo::InternalLinkValidator.new.process(html)
result.valid? # => true/false
result.broken_links # => [{ href: "...", path: "...", suggestion: "..." }, ...]
Populate link graph after a successful save:
Seo::InternalLinkValidator.upsert_editorial_links!(article)
Defined Under Namespace
Classes: BrokenLink, Result
Constant Summary collapse
- WY_HOSTNAME_PATTERN =
/\Awww\.warmlyyours\./i- LOCALE_PATTERN =
%r{^/([a-z]{2}-[A-Z]{2}|[a-z]{2}(?=/)|%7B%7B\s*locale\s*%7D%7D|\{\{[\s]*locale[\s]*\}\})}- POST_PATH_PATTERN =
%r{\A/posts/([^/?#]+)}- WEB_BASE =
'https://www.warmlyyours.com'- MAX_SEMANTIC_FALLBACKS_PER_PROCESS =
Cap: validating an HTML body can surface several broken links at once.
Embedding lookups call OpenAI per query (~200ms), so only resolve the
first few via semantic fallback — the cheap tiers still cover the rest. 3- CANDIDATE_LIMIT =
3
Class Method Summary collapse
-
.upsert_editorial_links!(article) ⇒ Object
Upsert editorial link graph entries for an article's content.
Instance Method Summary collapse
-
#extract_editorial_link_data(html) ⇒ Array<Hash>
Extract editorial link data from HTML in the format SiteMapLink.upsert_for_page! expects.
-
#process(html, skip_http_ping: false) ⇒ Result
Validate all internal WarmlyYours links in the given HTML.
-
#suggest_candidates(path, limit: CANDIDATE_LIMIT, allow_semantic: true) ⇒ Object
Public so callers (e.g. content validators, ad-hoc rake tasks) can resolve a single broken path without going through the full HTML scan.
Methods inherited from BaseService
#initialize, #log_debug, #log_error, #log_info, #log_warning, #logger, #options, #tagged_logger
Constructor Details
This class inherits a constructor from BaseService
Class Method Details
.upsert_editorial_links!(article) ⇒ Object
Upsert editorial link graph entries for an article's content.
Extracts internal links from the article's HTML and writes them to SiteMapLink,
making the link graph immediately accurate without waiting for the nightly crawler.
53 54 55 56 57 58 59 60 61 62 63 64 |
# File 'app/services/seo/internal_link_validator.rb', line 53 def self.upsert_editorial_links!(article) html = article.solution return if html.blank? from_site_map = article.site_maps.find_by(locale: 'en-US') || article.site_maps.first return unless from_site_map links = new.extract_editorial_link_data(html) return if links.empty? SiteMapLink.upsert_for_page!(from_site_map, links) end |
Instance Method Details
#extract_editorial_link_data(html) ⇒ Array<Hash>
Extract editorial link data from HTML in the format SiteMapLink.upsert_for_page! expects.
70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 |
# File 'app/services/seo/internal_link_validator.rb', line 70 def extract_editorial_link_data(html) return [] if html.blank? resolved = normalize_liquid_locale(html).gsub(/\{\{[\s]*locale[\s]*\}\}/, 'en-US') doc = Nokogiri::HTML::DocumentFragment.parse(resolved) links = [] seen = Set.new doc.css('a[href]').each do |anchor| href = anchor['href'].to_s.strip next if href.blank? path = extract_path(href) next if path.blank? || path == '/' uri = Addressable::URI.parse(href) rescue nil next unless uri next unless uri.host.nil? ? href.start_with?('/') : uri.host =~ WY_HOSTNAME_PATTERN key = "#{path}|editorial" next if seen.include?(key) seen << key links << { to_path: path, anchor_text: anchor.text.strip.truncate(255), link_type: 'editorial', context_snippet: anchor.ancestors('p, li, td, div').first&.text.to_s.squish.truncate(200) } end links end |
#process(html, skip_http_ping: false) ⇒ Result
Validate all internal WarmlyYours links in the given HTML.
110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 |
# File 'app/services/seo/internal_link_validator.rb', line 110 def process(html, skip_http_ping: false) return Result.new if html.blank? hrefs = extract_internal_links(html) return Result.new if hrefs.empty? broken = [] semantic_budget = MAX_SEMANTIC_FALLBACKS_PER_PROCESS hrefs.each do |href| path = extract_path(href) next if path.blank? || path == '/' next if resolve_via_db(path) next if !skip_http_ping && resolve_via_http(path) candidates = suggest_candidates(path, limit: CANDIDATE_LIMIT, allow_semantic: semantic_budget.positive?) semantic_budget -= 1 if candidates.any? && semantic_budget.positive? broken << BrokenLink.new( href: href, path: path, suggestion: candidates.first, did_you_mean: candidates ) end Result.new(broken_links: broken, checked_count: hrefs.size) end |
#suggest_candidates(path, limit: CANDIDATE_LIMIT, allow_semantic: true) ⇒ Object
Public so callers (e.g. content validators, ad-hoc rake tasks) can resolve
a single broken path without going through the full HTML scan.
140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 |
# File 'app/services/seo/internal_link_validator.rb', line 140 def suggest_candidates(path, limit: CANDIDATE_LIMIT, allow_semantic: true) candidates = [] legacy = Seo::BrokenLinkRedirectMap.lookup(path) candidates << legacy if legacy if (match = path.match(POST_PATH_PATTERN)) post_match = suggest_post_correction(match[1]) candidates << post_match if post_match end trigram_match = SiteMap.active.similar_path(path).limit(1).pick(:path) if trigram_match guess_words = significant_words(path.split('/').last.to_s.tr('-', ' ')) candidates << trigram_match if sufficient_overlap?(guess_words, trigram_match.split('/').last.to_s.tr('-', ' ')) end candidates.concat(semantic_candidates(path, limit: limit)) if allow_semantic && candidates.size < limit candidates.compact.uniq.first(limit) end |