Class: Seo::InternalLinkValidator
- Inherits:
-
BaseService
- Object
- BaseService
- Seo::InternalLinkValidator
- Defined in:
- app/services/seo/internal_link_validator.rb
Overview
Validates that internal WarmlyYours links in HTML content point to real pages.
Three-tier validation strategy:
Tier 1 (DB lookup, fast): SiteMap path lookup (active OR archived-but-200) + Post/slug check
Tier 2 (HTTP ping, resilient): HEAD request with retry for pages not yet in the sitemap
Tier 3 (Suggestion): Legacy URL pattern resolution via BrokenLinkRedirectMap (DB-backed)
Also provides editorial link extraction for populating the SiteMapLink graph
at save time rather than waiting for the nightly crawler.
Usage:
result = Seo::InternalLinkValidator.new.process(html)
result.valid? # => true/false
result.broken_links # => [{ href: "...", path: "...", suggestion: "..." }, ...]
Populate link graph after a successful save:
Seo::InternalLinkValidator.upsert_editorial_links!(article)
Defined Under Namespace
Classes: BrokenLink, Result
Constant Summary collapse
- WY_HOSTNAME_PATTERN =
Regex pattern matching wy hostname.
/\Awww\.warmlyyours\./i- LOCALE_PATTERN =
Regex pattern matching locale.
%r{^/([a-z]{2}-[A-Z]{2}|[a-z]{2}(?=/)|%7B%7B\s*locale\s*%7D%7D|\{\{\s*locale\s*\}\})}- POST_PATH_PATTERN =
Regex pattern matching post path.
%r{\A/posts/([^/?#]+)}- WEB_BASE =
Web base.
'https://www.warmlyyours.com'- MAX_SEMANTIC_FALLBACKS_PER_PROCESS =
Cap: validating an HTML body can surface several broken links at once.
Embedding lookups call OpenAI per query (~200ms), so only resolve the
first few via semantic fallback — the cheap tiers still cover the rest. 3- CANDIDATE_LIMIT =
Limit for candidate.
3
Instance Attribute Summary
Attributes inherited from BaseService
Class Method Summary collapse
-
.upsert_editorial_links!(article) ⇒ Object
Upsert editorial link graph entries for an article's content.
Instance Method Summary collapse
-
#extract_editorial_link_data(html) ⇒ Array<Hash>
Extract editorial link data from HTML in the format SiteMapLink.upsert_for_page! expects.
-
#internal_link_paths(html) ⇒ Set<String>
Set of locale-stripped internal link paths present in HTML.
-
#process(html, skip_http_ping: false) ⇒ Result
Validate all internal WarmlyYours links in the given HTML.
-
#suggest_candidates(path, limit: CANDIDATE_LIMIT, allow_semantic: true) ⇒ Object
Public so callers (e.g. content validators, ad-hoc rake tasks) can resolve a single broken path without going through the full HTML scan.
Methods inherited from BaseService
#initialize, #log_debug, #log_error, #log_info, #log_warning, #logger, #tagged_logger
Constructor Details
This class inherits a constructor from BaseService
Class Method Details
.upsert_editorial_links!(article) ⇒ Object
Upsert editorial link graph entries for an article's content.
Extracts internal links from the article's HTML and writes them to SiteMapLink,
making the link graph immediately accurate without waiting for the nightly crawler.
59 60 61 62 63 64 65 66 67 68 69 70 |
# File 'app/services/seo/internal_link_validator.rb', line 59 def self.upsert_editorial_links!(article) html = article.solution return if html.blank? from_site_map = article.site_maps.find_by(locale: 'en-US') || article.site_maps.first return unless from_site_map links = new.extract_editorial_link_data(html) return if links.empty? SiteMapLink.upsert_for_page!(from_site_map, links) end |
Instance Method Details
#extract_editorial_link_data(html) ⇒ Array<Hash>
Extract editorial link data from HTML in the format SiteMapLink.upsert_for_page! expects.
76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 |
# File 'app/services/seo/internal_link_validator.rb', line 76 def extract_editorial_link_data(html) return [] if html.blank? resolved = normalize_liquid_locale(html).gsub(/\{\{\s*locale\s*\}\}/, 'en-US') doc = Nokogiri::HTML::DocumentFragment.parse(resolved) links = [] seen = Set.new doc.css('a[href]').each do |anchor| href = anchor['href'].to_s.strip next if href.blank? path = extract_path(href) next if path.blank? || path == '/' uri = begin Addressable::URI.parse(href) rescue StandardError nil end next unless uri next unless uri.host.nil? ? href.start_with?('/') : uri.host =~ WY_HOSTNAME_PATTERN key = "#{path}|editorial" next if seen.include?(key) seen << key links << { to_path: path, anchor_text: anchor.text.strip.truncate(255), link_type: 'editorial', context_snippet: anchor.ancestors('p, li, td, div').first&.text.to_s.squish.truncate(200) } end links end |
#internal_link_paths(html) ⇒ Set<String>
Set of locale-stripped internal link paths present in HTML. Callers use this
to tell whether a broken link was introduced by an edit or carried over
unchanged from the prior body — pre-existing broken links should not block
an otherwise-valid save.
184 185 186 187 188 189 190 191 |
# File 'app/services/seo/internal_link_validator.rb', line 184 def internal_link_paths(html) return Set.new if html.blank? extract_internal_links(html) .filter_map { |href| extract_path(href) } .reject { |path| path.blank? || path == '/' } .to_set end |
#process(html, skip_http_ping: false) ⇒ Result
Validate all internal WarmlyYours links in the given HTML.
120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 |
# File 'app/services/seo/internal_link_validator.rb', line 120 def process(html, skip_http_ping: false) return Result.new if html.blank? hrefs = extract_internal_links(html) return Result.new if hrefs.empty? broken = [] semantic_budget = MAX_SEMANTIC_FALLBACKS_PER_PROCESS hrefs.each do |href| path = extract_path(href) next if path.blank? || path == '/' next if resolve_via_db(path) next if !skip_http_ping && resolve_via_http(path) candidates = suggest_candidates(path, limit: CANDIDATE_LIMIT, allow_semantic: semantic_budget.positive?) semantic_budget -= 1 if candidates.any? && semantic_budget.positive? broken << BrokenLink.new( href: href, path: path, suggestion: candidates.first, did_you_mean: candidates ) end Result.new(broken_links: broken, checked_count: hrefs.size) end |
#suggest_candidates(path, limit: CANDIDATE_LIMIT, allow_semantic: true) ⇒ Object
Public so callers (e.g. content validators, ad-hoc rake tasks) can resolve
a single broken path without going through the full HTML scan.
150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 |
# File 'app/services/seo/internal_link_validator.rb', line 150 def suggest_candidates(path, limit: CANDIDATE_LIMIT, allow_semantic: true) candidates = [] legacy = Seo::BrokenLinkRedirectMap.lookup(path) candidates << legacy if legacy if (match = path.match(POST_PATH_PATTERN)) post_match = suggest_post_correction(match[1]) candidates << post_match if post_match end trigram_match = SiteMap.active.similar_path(path).limit(1).pick(:path) if trigram_match guess_words = significant_words(path.split('/').last.to_s.tr('-', ' ')) candidates << trigram_match if sufficient_overlap?(guess_words, trigram_match.split('/').last.to_s.tr('-', ' ')) end candidates.concat(semantic_candidates(path, limit: limit)) if allow_semantic && candidates.size < limit # Never echo the broken path back as its own suggestion. Semantic search # can surface an archived page's own embedding (e.g. /trade/instant-quote, # archived with a 301), producing a useless "did you mean /trade/instant-quote?" # that stalls the model in a retry loop (conv 3413). Drop self-references. normalized_input = path.to_s.chomp('/').downcase candidates.compact.uniq.reject { |c| c.to_s.chomp('/').downcase == normalized_input }.first(limit) end |