Class: Seo::InternalLinkValidator

Inherits:
BaseService show all
Defined in:
app/services/seo/internal_link_validator.rb

Overview

Validates that internal WarmlyYours links in HTML content point to real pages.

Three-tier validation strategy:
Tier 1 (DB lookup, fast): SiteMap path lookup (active OR archived-but-200) + Post/slug check
Tier 2 (HTTP ping, resilient): HEAD request with retry for pages not yet in the sitemap
Tier 3 (Suggestion): Legacy URL pattern resolution via BrokenLinkRedirectMap (DB-backed)

Also provides editorial link extraction for populating the SiteMapLink graph
at save time rather than waiting for the nightly crawler.

Usage:
result = Seo::InternalLinkValidator.new.process(html)
result.valid? # => true/false
result.broken_links # => [{ href: "...", path: "...", suggestion: "..." }, ...]

Populate link graph after a successful save:

Seo::InternalLinkValidator.upsert_editorial_links!(article)

Defined Under Namespace

Classes: BrokenLink, Result

Constant Summary collapse

WY_HOSTNAME_PATTERN =

Regex pattern matching wy hostname.

/\Awww\.warmlyyours\./i
LOCALE_PATTERN =

Regex pattern matching locale.

%r{^/([a-z]{2}-[A-Z]{2}|[a-z]{2}(?=/)|%7B%7B\s*locale\s*%7D%7D|\{\{\s*locale\s*\}\})}
POST_PATH_PATTERN =

Regex pattern matching post path.

%r{\A/posts/([^/?#]+)}
WEB_BASE =

Web base.

'https://www.warmlyyours.com'
MAX_SEMANTIC_FALLBACKS_PER_PROCESS =

Cap: validating an HTML body can surface several broken links at once.
Embedding lookups call OpenAI per query (~200ms), so only resolve the
first few via semantic fallback — the cheap tiers still cover the rest.

3
CANDIDATE_LIMIT =

Limit for candidate.

3

Instance Attribute Summary

Attributes inherited from BaseService

#options

Class Method Summary collapse

Instance Method Summary collapse

Methods inherited from BaseService

#initialize, #log_debug, #log_error, #log_info, #log_warning, #logger, #tagged_logger

Constructor Details

This class inherits a constructor from BaseService

Class Method Details

.upsert_editorial_links!(article) ⇒ Object

Upsert editorial link graph entries for an article's content.
Extracts internal links from the article's HTML and writes them to SiteMapLink,
making the link graph immediately accurate without waiting for the nightly crawler.

Parameters:

  • article (Article)

    the article whose content to extract links from



59
60
61
62
63
64
65
66
67
68
69
70
# File 'app/services/seo/internal_link_validator.rb', line 59

def self.upsert_editorial_links!(article)
  html = article.solution
  return if html.blank?

  from_site_map = article.site_maps.find_by(locale: 'en-US') || article.site_maps.first
  return unless from_site_map

  links = new.extract_editorial_link_data(html)
  return if links.empty?

  SiteMapLink.upsert_for_page!(from_site_map, links)
end

Instance Method Details

Extract editorial link data from HTML in the format SiteMapLink.upsert_for_page! expects.

Parameters:

  • html (String)

    HTML content (may contain {locale} placeholders)

Returns:

  • (Array<Hash>)

    each hash: { to_path:, anchor_text:, link_type:, context_snippet: }



76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
# File 'app/services/seo/internal_link_validator.rb', line 76

def extract_editorial_link_data(html)
  return [] if html.blank?

  resolved = normalize_liquid_locale(html).gsub(/\{\{\s*locale\s*\}\}/, 'en-US')
  doc = Nokogiri::HTML::DocumentFragment.parse(resolved)
  links = []
  seen = Set.new

  doc.css('a[href]').each do |anchor|
    href = anchor['href'].to_s.strip
    next if href.blank?

    path = extract_path(href)
    next if path.blank? || path == '/'

    uri = begin
      Addressable::URI.parse(href)
    rescue StandardError
      nil
    end
    next unless uri
    next unless uri.host.nil? ? href.start_with?('/') : uri.host =~ WY_HOSTNAME_PATTERN

    key = "#{path}|editorial"
    next if seen.include?(key)

    seen << key

    links << {
      to_path: path,
      anchor_text: anchor.text.strip.truncate(255),
      link_type: 'editorial',
      context_snippet: anchor.ancestors('p, li, td, div').first&.text.to_s.squish.truncate(200)
    }
  end

  links
end

Set of locale-stripped internal link paths present in HTML. Callers use this
to tell whether a broken link was introduced by an edit or carried over
unchanged from the prior body — pre-existing broken links should not block
an otherwise-valid save.

Parameters:

  • html (String)

    HTML content (may contain {locale} placeholders)

Returns:

  • (Set<String>)

    distinct internal paths (locale + host stripped)



184
185
186
187
188
189
190
191
# File 'app/services/seo/internal_link_validator.rb', line 184

def internal_link_paths(html)
  return Set.new if html.blank?

  extract_internal_links(html)
    .filter_map { |href| extract_path(href) }
    .reject { |path| path.blank? || path == '/' }
    .to_set
end

#process(html, skip_http_ping: false) ⇒ Result

Validate all internal WarmlyYours links in the given HTML.

Parameters:

  • html (String)

    HTML content (may contain {locale} placeholders)

  • skip_http_ping (Boolean) (defaults to: false)

    Skip Tier 2 HTTP fallback (for tests/bulk scans)

Returns:



120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
# File 'app/services/seo/internal_link_validator.rb', line 120

def process(html, skip_http_ping: false)
  return Result.new if html.blank?

  hrefs = extract_internal_links(html)
  return Result.new if hrefs.empty?

  broken = []
  semantic_budget = MAX_SEMANTIC_FALLBACKS_PER_PROCESS
  hrefs.each do |href|
    path = extract_path(href)
    next if path.blank? || path == '/'

    next if resolve_via_db(path)
    next if !skip_http_ping && resolve_via_http(path)

    candidates = suggest_candidates(path, limit: CANDIDATE_LIMIT, allow_semantic: semantic_budget.positive?)
    semantic_budget -= 1 if candidates.any? && semantic_budget.positive?
    broken << BrokenLink.new(
      href: href,
      path: path,
      suggestion: candidates.first,
      did_you_mean: candidates
    )
  end

  Result.new(broken_links: broken, checked_count: hrefs.size)
end

#suggest_candidates(path, limit: CANDIDATE_LIMIT, allow_semantic: true) ⇒ Object

Public so callers (e.g. content validators, ad-hoc rake tasks) can resolve
a single broken path without going through the full HTML scan.



150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
# File 'app/services/seo/internal_link_validator.rb', line 150

def suggest_candidates(path, limit: CANDIDATE_LIMIT, allow_semantic: true)
  candidates = []

  legacy = Seo::BrokenLinkRedirectMap.lookup(path)
  candidates << legacy if legacy

  if (match = path.match(POST_PATH_PATTERN))
    post_match = suggest_post_correction(match[1])
    candidates << post_match if post_match
  end

  trigram_match = SiteMap.active.similar_path(path).limit(1).pick(:path)
  if trigram_match
    guess_words = significant_words(path.split('/').last.to_s.tr('-', ' '))
    candidates << trigram_match if sufficient_overlap?(guess_words, trigram_match.split('/').last.to_s.tr('-', ' '))
  end

  candidates.concat(semantic_candidates(path, limit: limit)) if allow_semantic && candidates.size < limit

  # Never echo the broken path back as its own suggestion. Semantic search
  # can surface an archived page's own embedding (e.g. /trade/instant-quote,
  # archived with a 301), producing a useless "did you mean /trade/instant-quote?"
  # that stalls the model in a retry loop (conv 3413). Drop self-references.
  normalized_input = path.to_s.chomp('/').downcase
  candidates.compact.uniq.reject { |c| c.to_s.chomp('/').downcase == normalized_input }.first(limit)
end