Blog Schema Extraction

This system extracts structured data (schema markup) from blog articles and saves it to the database. It supports FAQ and HowTo schema types, and provides both manual and automated extraction capabilities with comprehensive reporting.

Overview

The blog schema extraction system consists of:

AutoBlogSchemaExtractionWorker: Core worker that processes articles directly and provides detailed reporting
BlogSchemaExtractor: Service that performs the actual AI-powered schema extraction
rake blog:schema:extract: Single rake task for all extraction operations
Scheduled Jobs: Daily automated extraction of eligible articles

Usage

All rake tasks use environment variables for parameters. Do not use positional arguments.

Extract schema for articles

The main extraction task processes articles directly and provides comprehensive terminal reporting:

# Extract schema for all eligible articles (auto mode)
rake blog:schema:extract

# Extract schema for up to 10 eligible articles
LIMIT=10 rake blog:schema:extract

# Extract schema for specific articles by ID
ARTICLE_IDS=1,2,3 rake blog:schema:extract

# Extract schema for up to 5 specific articles, forcing update even if schema exists
ARTICLE_IDS=1,2,3 LIMIT=5 FORCE_UPDATE=true rake blog:schema:extract

# Process articles and send email report
LIMIT=10 SEND_EMAIL=true rake blog:schema:extract

Environment Variables

Variable	Description	Default
`ARTICLE_IDS`	Comma-separated list of article IDs to process	All eligible articles
`LIMIT`	Maximum number of articles to process	50 (auto mode), no limit (specific IDs)
`FORCE_UPDATE`	Re-extract schemas even if they exist	`false`
`SEND_EMAIL`	Send email report when complete	`false`

Output

The rake task provides real-time feedback:

Starting comprehensive schema extraction...
  Article IDs: All eligible articles
  Limit: 2
  Force Update: false
  Send Email Report: false

Processing 2 articles...

  [1/2] Processing article 88: Radiant Floor Heating Transforms A Cold Chicago Basement... ✅ Extracted 1 schemas (HowTo) - high confidence
  [2/2] Processing article 86: Commercial Snow Melting For A Local Business In Lincolnshire, IL... ✅ Extracted 0 schemas () - none confidence

================================================================================
EXTRACTION SUMMARY
================================================================================
Total articles processed: 2
Successfully processed: 2
Errors: 0
Success rate: 100.0%

SUCCESSFULLY PROCESSED ARTICLES:
--------------------------------------------------
  88: Radiant Floor Heating Transforms A Cold Chicago Basement
    Status: Extracted 1 schemas
    Confidence: high
    Schema Types: HowTo
    CRM: https://crm.warmlyyours.me:3000/posts/88

  86: Commercial Snow Melting For A Local Business In Lincolnshire, IL
    Status: Extracted 0 schemas
    Confidence: none
    Schema Types:
    CRM: https://crm.warmlyyours.me:3000/posts/86

================================================================================
Extraction completed!

Email Reports

When SEND_EMAIL=true is set, the system sends a detailed email report to the heatwave team including:

Processing statistics (total, successful, errors)
Detailed results for each processed article
Confidence levels and reasoning
Schema types extracted
Direct CRM links to view articles
Error details for failed extractions

Automated Processing

Scheduled Jobs

The system includes a daily scheduled job that automatically processes eligible articles:

Schedule: Daily at 2:00 AM Chicago time
Worker: AutoBlogSchemaExtractionWorker
Queue: api_heavy
Default Limit: 50 articles per run
Email Reports: Automatically sent after each run

Eligible Articles

Articles are considered eligible for automatic extraction if they:

Are published or scheduled blog posts
Have no schema markup (or empty schema markup)
Have never been checked for schema extraction (schema_extracted_at is nil)
Are blog posts (type = Post)

Schema Types

The system currently supports extraction of:

FAQ (Frequently Asked Questions): For content with questions and answers
HowTo (Step-by-step instructions): For procedures, tutorials, and guides

Note: Article and BlogPosting schemas are handled separately by the system.

Technical Details

Architecture

Direct Processing: Articles are processed synchronously for immediate feedback
AI-Powered: Uses OpenAI GPT-4 for intelligent schema identification and extraction
Cache Management: Automatically purges edge cache after schema updates
Error Handling: Comprehensive error handling with detailed reporting
Rate Limiting: Built-in delays between API calls to respect rate limits

Database Schema

schema_markup: JSONB column storing extracted schema markup
schema_extracted_at: Timestamp of last extraction attempt
schema_extraction_attempts: Number of extraction attempts

Performance

Batch Processing: Configurable limits to control processing load
Selective Updates: Only processes articles that need extraction (unless FORCE_UPDATE=true)
Efficient Queries: Optimized database queries for finding eligible articles

Troubleshooting

Common Issues

No articles found: Check if articles meet eligibility criteria
API rate limits: System includes automatic retry logic with exponential backoff
Extraction failures: Check article content quality and AI service availability

Monitoring

Check Sidekiq dashboard for job status
Review email reports for processing results
Monitor logs for detailed error information

Notes

All tasks use environment variables for input. Do not use positional arguments.
The system automatically handles edge cache purging after schema updates.
Email reports are sent to heatwaveteam@warmlyyours.com.
The schema_extracted_at column is automatically timestamped whenever schema extraction occurs.
CRM links in reports point to /posts/<id> for blog post management.