Skip to content

Blog Schema Extraction

This system extracts structured data (schema markup) from blog articles and saves it to the database. It supports FAQ and HowTo schema types, and provides both manual and automated extraction capabilities with comprehensive reporting.

The blog schema extraction system consists of:

  • AutoBlogSchemaExtractionWorker: Core worker that processes articles directly and provides detailed reporting
  • BlogSchemaExtractor: Service that performs the actual AI-powered schema extraction
  • rake blog:schema:extract: Single rake task for all extraction operations
  • Scheduled Jobs: Daily automated extraction of eligible articles

All rake tasks use environment variables for parameters. Do not use positional arguments.

The main extraction task processes articles directly and provides comprehensive terminal reporting:

Terminal window
# Extract schema for all eligible articles (auto mode)
rake blog:schema:extract
# Extract schema for up to 10 eligible articles
LIMIT=10 rake blog:schema:extract
# Extract schema for specific articles by ID
ARTICLE_IDS=1,2,3 rake blog:schema:extract
# Extract schema for up to 5 specific articles, forcing update even if schema exists
ARTICLE_IDS=1,2,3 LIMIT=5 FORCE_UPDATE=true rake blog:schema:extract
# Process articles and send email report
LIMIT=10 SEND_EMAIL=true rake blog:schema:extract
VariableDescriptionDefault
ARTICLE_IDSComma-separated list of article IDs to processAll eligible articles
LIMITMaximum number of articles to process50 (auto mode), no limit (specific IDs)
FORCE_UPDATERe-extract schemas even if they existfalse
SEND_EMAILSend email report when completefalse

The rake task provides real-time feedback:

Starting comprehensive schema extraction...
Article IDs: All eligible articles
Limit: 2
Force Update: false
Send Email Report: false
Processing 2 articles...
[1/2] Processing article 88: Radiant Floor Heating Transforms A Cold Chicago Basement... ✅ Extracted 1 schemas (HowTo) - high confidence
[2/2] Processing article 86: Commercial Snow Melting For A Local Business In Lincolnshire, IL... ✅ Extracted 0 schemas () - none confidence
================================================================================
EXTRACTION SUMMARY
================================================================================
Total articles processed: 2
Successfully processed: 2
Errors: 0
Success rate: 100.0%
SUCCESSFULLY PROCESSED ARTICLES:
--------------------------------------------------
88: Radiant Floor Heating Transforms A Cold Chicago Basement
Status: Extracted 1 schemas
Confidence: high
Schema Types: HowTo
CRM: https://crm.warmlyyours.me:3000/posts/88
86: Commercial Snow Melting For A Local Business In Lincolnshire, IL
Status: Extracted 0 schemas
Confidence: none
Schema Types:
CRM: https://crm.warmlyyours.me:3000/posts/86
================================================================================
Extraction completed!

When SEND_EMAIL=true is set, the system sends a detailed email report to the heatwave team including:

  • Processing statistics (total, successful, errors)
  • Detailed results for each processed article
  • Confidence levels and reasoning
  • Schema types extracted
  • Direct CRM links to view articles
  • Error details for failed extractions

The system includes a daily scheduled job that automatically processes eligible articles:

  • Schedule: Daily at 2:00 AM Chicago time
  • Worker: AutoBlogSchemaExtractionWorker
  • Queue: api_heavy
  • Default Limit: 50 articles per run
  • Email Reports: Automatically sent after each run

Articles are considered eligible for automatic extraction if they:

  1. Are published or scheduled blog posts
  2. Have no schema markup (or empty schema markup)
  3. Have never been checked for schema extraction (schema_extracted_at is nil)
  4. Are blog posts (type = Post)

The system currently supports extraction of:

  • FAQ (Frequently Asked Questions): For content with questions and answers
  • HowTo (Step-by-step instructions): For procedures, tutorials, and guides

Note: Article and BlogPosting schemas are handled separately by the system.

  • Direct Processing: Articles are processed synchronously for immediate feedback
  • AI-Powered: Uses OpenAI GPT-4 for intelligent schema identification and extraction
  • Cache Management: Automatically purges edge cache after schema updates
  • Error Handling: Comprehensive error handling with detailed reporting
  • Rate Limiting: Built-in delays between API calls to respect rate limits
  • schema_markup: JSONB column storing extracted schema markup
  • schema_extracted_at: Timestamp of last extraction attempt
  • schema_extraction_attempts: Number of extraction attempts
  • Batch Processing: Configurable limits to control processing load
  • Selective Updates: Only processes articles that need extraction (unless FORCE_UPDATE=true)
  • Efficient Queries: Optimized database queries for finding eligible articles
  1. No articles found: Check if articles meet eligibility criteria
  2. API rate limits: System includes automatic retry logic with exponential backoff
  3. Extraction failures: Check article content quality and AI service availability
  • Check Sidekiq dashboard for job status
  • Review email reports for processing results
  • Monitor logs for detailed error information
  • All tasks use environment variables for input. Do not use positional arguments.
  • The system automatically handles edge cache purging after schema updates.
  • Email reports are sent to heatwaveteam@warmlyyours.com.
  • The schema_extracted_at column is automatically timestamped whenever schema extraction occurs.
  • CRM links in reports point to /posts/<id> for blog post management.