Class: Seo::InternalLinkValidator

Inherits:
BaseService show all
Defined in:
app/services/seo/internal_link_validator.rb

Overview

Validates that internal WarmlyYours links in HTML content point to real pages.

Three-tier validation strategy:
Tier 1 (DB lookup, fast): SiteMap path lookup (active OR archived-but-200) + Post/slug check
Tier 2 (HTTP ping, resilient): HEAD request with retry for pages not yet in the sitemap
Tier 3 (Suggestion): Legacy URL pattern resolution via BrokenLinkRedirectMap (DB-backed)

Also provides editorial link extraction for populating the SiteMapLink graph
at save time rather than waiting for the nightly crawler.

Usage:
result = Seo::InternalLinkValidator.new.process(html)
result.valid? # => true/false
result.broken_links # => [{ href: "...", path: "...", suggestion: "..." }, ...]

Populate link graph after a successful save:

Seo::InternalLinkValidator.upsert_editorial_links!(article)

Defined Under Namespace

Classes: BrokenLink, Result

Constant Summary collapse

WY_HOSTNAME_PATTERN =
/\Awww\.warmlyyours\./i
LOCALE_PATTERN =
%r{^/([a-z]{2}-[A-Z]{2}|[a-z]{2}(?=/)|%7B%7B\s*locale\s*%7D%7D|\{\{[\s]*locale[\s]*\}\})}
POST_PATH_PATTERN =
%r{\A/posts/([^/?#]+)}
WEB_BASE =
'https://www.warmlyyours.com'
MAX_SEMANTIC_FALLBACKS_PER_PROCESS =

Cap: validating an HTML body can surface several broken links at once.
Embedding lookups call OpenAI per query (~200ms), so only resolve the
first few via semantic fallback — the cheap tiers still cover the rest.

3
CANDIDATE_LIMIT =
3

Class Method Summary collapse

Instance Method Summary collapse

Methods inherited from BaseService

#initialize, #log_debug, #log_error, #log_info, #log_warning, #logger, #options, #tagged_logger

Constructor Details

This class inherits a constructor from BaseService

Class Method Details

.upsert_editorial_links!(article) ⇒ Object

Upsert editorial link graph entries for an article's content.
Extracts internal links from the article's HTML and writes them to SiteMapLink,
making the link graph immediately accurate without waiting for the nightly crawler.

Parameters:

  • article (Article)

    the article whose content to extract links from



53
54
55
56
57
58
59
60
61
62
63
64
# File 'app/services/seo/internal_link_validator.rb', line 53

def self.upsert_editorial_links!(article)
  html = article.solution
  return if html.blank?

  from_site_map = article.site_maps.find_by(locale: 'en-US') || article.site_maps.first
  return unless from_site_map

  links = new.extract_editorial_link_data(html)
  return if links.empty?

  SiteMapLink.upsert_for_page!(from_site_map, links)
end

Instance Method Details

Extract editorial link data from HTML in the format SiteMapLink.upsert_for_page! expects.

Parameters:

  • html (String)

    HTML content (may contain {locale} placeholders)

Returns:

  • (Array<Hash>)

    each hash: { to_path:, anchor_text:, link_type:, context_snippet: }



70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
# File 'app/services/seo/internal_link_validator.rb', line 70

def extract_editorial_link_data(html)
  return [] if html.blank?

  resolved = normalize_liquid_locale(html).gsub(/\{\{[\s]*locale[\s]*\}\}/, 'en-US')
  doc = Nokogiri::HTML::DocumentFragment.parse(resolved)
  links = []
  seen = Set.new

  doc.css('a[href]').each do |anchor|
    href = anchor['href'].to_s.strip
    next if href.blank?

    path = extract_path(href)
    next if path.blank? || path == '/'

    uri = Addressable::URI.parse(href) rescue nil
    next unless uri
    next unless uri.host.nil? ? href.start_with?('/') : uri.host =~ WY_HOSTNAME_PATTERN

    key = "#{path}|editorial"
    next if seen.include?(key)

    seen << key

    links << {
      to_path: path,
      anchor_text: anchor.text.strip.truncate(255),
      link_type: 'editorial',
      context_snippet: anchor.ancestors('p, li, td, div').first&.text.to_s.squish.truncate(200)
    }
  end

  links
end

#process(html, skip_http_ping: false) ⇒ Result

Validate all internal WarmlyYours links in the given HTML.

Parameters:

  • html (String)

    HTML content (may contain {locale} placeholders)

  • skip_http_ping (Boolean) (defaults to: false)

    Skip Tier 2 HTTP fallback (for tests/bulk scans)

Returns:



110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
# File 'app/services/seo/internal_link_validator.rb', line 110

def process(html, skip_http_ping: false)
  return Result.new if html.blank?

  hrefs = extract_internal_links(html)
  return Result.new if hrefs.empty?

  broken = []
  semantic_budget = MAX_SEMANTIC_FALLBACKS_PER_PROCESS
  hrefs.each do |href|
    path = extract_path(href)
    next if path.blank? || path == '/'

    next if resolve_via_db(path)
    next if !skip_http_ping && resolve_via_http(path)

    candidates = suggest_candidates(path, limit: CANDIDATE_LIMIT, allow_semantic: semantic_budget.positive?)
    semantic_budget -= 1 if candidates.any? && semantic_budget.positive?
    broken << BrokenLink.new(
      href: href,
      path: path,
      suggestion: candidates.first,
      did_you_mean: candidates
    )
  end

  Result.new(broken_links: broken, checked_count: hrefs.size)
end

#suggest_candidates(path, limit: CANDIDATE_LIMIT, allow_semantic: true) ⇒ Object

Public so callers (e.g. content validators, ad-hoc rake tasks) can resolve
a single broken path without going through the full HTML scan.



140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
# File 'app/services/seo/internal_link_validator.rb', line 140

def suggest_candidates(path, limit: CANDIDATE_LIMIT, allow_semantic: true)
  candidates = []

  legacy = Seo::BrokenLinkRedirectMap.lookup(path)
  candidates << legacy if legacy

  if (match = path.match(POST_PATH_PATTERN))
    post_match = suggest_post_correction(match[1])
    candidates << post_match if post_match
  end

  trigram_match = SiteMap.active.similar_path(path).limit(1).pick(:path)
  if trigram_match
    guess_words = significant_words(path.split('/').last.to_s.tr('-', ' '))
    candidates << trigram_match if sufficient_overlap?(guess_words, trigram_match.split('/').last.to_s.tr('-', ' '))
  end

  candidates.concat(semantic_candidates(path, limit: limit)) if allow_semantic && candidates.size < limit

  candidates.compact.uniq.first(limit)
end