Blog Schema Extraction

This system extracts structured data (schema markup) from blog articles and saves it to the database. It supports FAQ and HowTo schema types, and provides both manual and automated extraction capabilities with comprehensive reporting.

Overview

The blog schema extraction system consists of:

  • AutoBlogSchemaExtractionWorker: Core worker that processes articles directly and provides detailed reporting
  • BlogSchemaExtractor: Service that performs the actual AI-powered schema extraction
  • rake blog:schema:extract: Single rake task for all extraction operations
  • Scheduled Jobs: Daily automated extraction of eligible articles

Usage

All rake tasks use environment variables for parameters. Do not use positional arguments.

Extract schema for articles

The main extraction task processes articles directly and provides comprehensive terminal reporting:

# Extract schema for all eligible articles (auto mode)
rake blog:schema:extract

# Extract schema for up to 10 eligible articles
LIMIT=10 rake blog:schema:extract

# Extract schema for specific articles by ID
ARTICLE_IDS=1,2,3 rake blog:schema:extract

# Extract schema for up to 5 specific articles, forcing update even if schema exists
ARTICLE_IDS=1,2,3 LIMIT=5 FORCE_UPDATE=true rake blog:schema:extract

# Process articles and send email report
LIMIT=10 SEND_EMAIL=true rake blog:schema:extract

Environment Variables

Variable Description Default
ARTICLE_IDS Comma-separated list of article IDs to process All eligible articles
LIMIT Maximum number of articles to process 50 (auto mode), no limit (specific IDs)
FORCE_UPDATE Re-extract schemas even if they exist false
SEND_EMAIL Send email report when complete false

Output

The rake task provides real-time feedback:

Starting comprehensive schema extraction...
  Article IDs: All eligible articles
  Limit: 2
  Force Update: false
  Send Email Report: false

Processing 2 articles...

  [1/2] Processing article 88: Radiant Floor Heating Transforms A Cold Chicago Basement... ✅ Extracted 1 schemas (HowTo) - high confidence
  [2/2] Processing article 86: Commercial Snow Melting For A Local Business In Lincolnshire, IL... ✅ Extracted 0 schemas () - none confidence

================================================================================
EXTRACTION SUMMARY
================================================================================
Total articles processed: 2
Successfully processed: 2
Errors: 0
Success rate: 100.0%

SUCCESSFULLY PROCESSED ARTICLES:
--------------------------------------------------
  88: Radiant Floor Heating Transforms A Cold Chicago Basement
    Status: Extracted 1 schemas
    Confidence: high
    Schema Types: HowTo
    CRM: https://crm.warmlyyours.me:3000/posts/88

  86: Commercial Snow Melting For A Local Business In Lincolnshire, IL
    Status: Extracted 0 schemas
    Confidence: none
    Schema Types: 
    CRM: https://crm.warmlyyours.me:3000/posts/86

================================================================================
Extraction completed!

Email Reports

When SEND_EMAIL=true is set, the system sends a detailed email report to the heatwave team including:

  • Processing statistics (total, successful, errors)
  • Detailed results for each processed article
  • Confidence levels and reasoning
  • Schema types extracted
  • Direct CRM links to view articles
  • Error details for failed extractions

Automated Processing

Scheduled Jobs

The system includes a daily scheduled job that automatically processes eligible articles:

  • Schedule: Daily at 2:00 AM Chicago time
  • Worker: AutoBlogSchemaExtractionWorker
  • Queue: api_heavy
  • Default Limit: 50 articles per run
  • Email Reports: Automatically sent after each run

Eligible Articles

Articles are considered eligible for automatic extraction if they:

  1. Are published or scheduled blog posts
  2. Have no schema markup (or empty schema markup)
  3. Have never been checked for schema extraction (schema_extracted_at is nil)
  4. Are blog posts (type = Post)

Schema Types

The system currently supports extraction of:

  • FAQ (Frequently Asked Questions): For content with questions and answers
  • HowTo (Step-by-step instructions): For procedures, tutorials, and guides

Note: Article and BlogPosting schemas are handled separately by the system.

Technical Details

Architecture

  • Direct Processing: Articles are processed synchronously for immediate feedback
  • AI-Powered: Uses OpenAI GPT-4 for intelligent schema identification and extraction
  • Cache Management: Automatically purges edge cache after schema updates
  • Error Handling: Comprehensive error handling with detailed reporting
  • Rate Limiting: Built-in delays between API calls to respect rate limits

Database Schema

  • schema_markup: JSONB column storing extracted schema markup
  • schema_extracted_at: Timestamp of last extraction attempt
  • schema_extraction_attempts: Number of extraction attempts

Performance

  • Batch Processing: Configurable limits to control processing load
  • Selective Updates: Only processes articles that need extraction (unless FORCE_UPDATE=true)
  • Efficient Queries: Optimized database queries for finding eligible articles

Troubleshooting

Common Issues

  1. No articles found: Check if articles meet eligibility criteria
  2. API rate limits: System includes automatic retry logic with exponential backoff
  3. Extraction failures: Check article content quality and AI service availability

Monitoring

  • Check Sidekiq dashboard for job status
  • Review email reports for processing results
  • Monitor logs for detailed error information

Notes

  • All tasks use environment variables for input. Do not use positional arguments.
  • The system automatically handles edge cache purging after schema updates.
  • Email reports are sent to heatwaveteam@warmlyyours.com.
  • The schema_extracted_at column is automatically timestamped whenever schema extraction occurs.
  • CRM links in reports point to /posts/<id> for blog post management.