Blog Schema Extraction
This system extracts structured data (schema markup) from blog articles and saves it to the database. It supports FAQ and HowTo schema types, and provides both manual and automated extraction capabilities with comprehensive reporting.
Overview
Section titled “Overview”The blog schema extraction system consists of:
AutoBlogSchemaExtractionWorker: Core worker that processes articles directly and provides detailed reportingBlogSchemaExtractor: Service that performs the actual AI-powered schema extractionrake blog:schema:extract: Single rake task for all extraction operations- Scheduled Jobs: Daily automated extraction of eligible articles
All rake tasks use environment variables for parameters. Do not use positional arguments.
Extract schema for articles
Section titled “Extract schema for articles”The main extraction task processes articles directly and provides comprehensive terminal reporting:
# Extract schema for all eligible articles (auto mode)rake blog:schema:extract
# Extract schema for up to 10 eligible articlesLIMIT=10 rake blog:schema:extract
# Extract schema for specific articles by IDARTICLE_IDS=1,2,3 rake blog:schema:extract
# Extract schema for up to 5 specific articles, forcing update even if schema existsARTICLE_IDS=1,2,3 LIMIT=5 FORCE_UPDATE=true rake blog:schema:extract
# Process articles and send email reportLIMIT=10 SEND_EMAIL=true rake blog:schema:extractEnvironment Variables
Section titled “Environment Variables”| Variable | Description | Default |
|---|---|---|
ARTICLE_IDS | Comma-separated list of article IDs to process | All eligible articles |
LIMIT | Maximum number of articles to process | 50 (auto mode), no limit (specific IDs) |
FORCE_UPDATE | Re-extract schemas even if they exist | false |
SEND_EMAIL | Send email report when complete | false |
Output
Section titled “Output”The rake task provides real-time feedback:
Starting comprehensive schema extraction... Article IDs: All eligible articles Limit: 2 Force Update: false Send Email Report: false
Processing 2 articles...
[1/2] Processing article 88: Radiant Floor Heating Transforms A Cold Chicago Basement... ✅ Extracted 1 schemas (HowTo) - high confidence [2/2] Processing article 86: Commercial Snow Melting For A Local Business In Lincolnshire, IL... ✅ Extracted 0 schemas () - none confidence
================================================================================EXTRACTION SUMMARY================================================================================Total articles processed: 2Successfully processed: 2Errors: 0Success rate: 100.0%
SUCCESSFULLY PROCESSED ARTICLES:-------------------------------------------------- 88: Radiant Floor Heating Transforms A Cold Chicago Basement Status: Extracted 1 schemas Confidence: high Schema Types: HowTo CRM: https://crm.warmlyyours.me:3000/posts/88
86: Commercial Snow Melting For A Local Business In Lincolnshire, IL Status: Extracted 0 schemas Confidence: none Schema Types: CRM: https://crm.warmlyyours.me:3000/posts/86
================================================================================Extraction completed!Email Reports
Section titled “Email Reports”When SEND_EMAIL=true is set, the system sends a detailed email report to the heatwave team including:
- Processing statistics (total, successful, errors)
- Detailed results for each processed article
- Confidence levels and reasoning
- Schema types extracted
- Direct CRM links to view articles
- Error details for failed extractions
Automated Processing
Section titled “Automated Processing”Scheduled Jobs
Section titled “Scheduled Jobs”The system includes a daily scheduled job that automatically processes eligible articles:
- Schedule: Daily at 2:00 AM Chicago time
- Worker:
AutoBlogSchemaExtractionWorker - Queue:
api_heavy - Default Limit: 50 articles per run
- Email Reports: Automatically sent after each run
Eligible Articles
Section titled “Eligible Articles”Articles are considered eligible for automatic extraction if they:
- Are published or scheduled blog posts
- Have no schema markup (or empty schema markup)
- Have never been checked for schema extraction (
schema_extracted_atis nil) - Are blog posts (type = Post)
Schema Types
Section titled “Schema Types”The system currently supports extraction of:
- FAQ (Frequently Asked Questions): For content with questions and answers
- HowTo (Step-by-step instructions): For procedures, tutorials, and guides
Note: Article and BlogPosting schemas are handled separately by the system.
Technical Details
Section titled “Technical Details”Architecture
Section titled “Architecture”- Direct Processing: Articles are processed synchronously for immediate feedback
- AI-Powered: Uses OpenAI GPT-4 for intelligent schema identification and extraction
- Cache Management: Automatically purges edge cache after schema updates
- Error Handling: Comprehensive error handling with detailed reporting
- Rate Limiting: Built-in delays between API calls to respect rate limits
Database Schema
Section titled “Database Schema”schema_markup: JSONB column storing extracted schema markupschema_extracted_at: Timestamp of last extraction attemptschema_extraction_attempts: Number of extraction attempts
Performance
Section titled “Performance”- Batch Processing: Configurable limits to control processing load
- Selective Updates: Only processes articles that need extraction (unless
FORCE_UPDATE=true) - Efficient Queries: Optimized database queries for finding eligible articles
Troubleshooting
Section titled “Troubleshooting”Common Issues
Section titled “Common Issues”- No articles found: Check if articles meet eligibility criteria
- API rate limits: System includes automatic retry logic with exponential backoff
- Extraction failures: Check article content quality and AI service availability
Monitoring
Section titled “Monitoring”- Check Sidekiq dashboard for job status
- Review email reports for processing results
- Monitor logs for detailed error information
- All tasks use environment variables for input. Do not use positional arguments.
- The system automatically handles edge cache purging after schema updates.
- Email reports are sent to
heatwaveteam@warmlyyours.com. - The
schema_extracted_atcolumn is automatically timestamped whenever schema extraction occurs. - CRM links in reports point to
/posts/<id>for blog post management.