Video System Documentation
This document provides comprehensive documentation for the video system, including upload, transcription, processing, and management features.
Table of Contents
- Overview
- Video Upload Process
- Video Transcription System
- VTT Generation
- Rake Tasks
- API Integration
- UI Components
- Troubleshooting
Overview
The video system provides comprehensive video management capabilities including:
- Video Upload: Direct creator uploads via Cloudflare Stream
- Transcription: High-quality transcription with AssemblyAI
- VTT Generation: Dynamic caption generation from structured transcripts
- SEO Optimization: Automated metadata generation
- Background Processing: Scalable job processing with Sidekiq
Key Components
- Video Model: Core data model with structured transcript JSON storage
- VideoProcessing::TranscriptionService: Main transcription orchestration
- VideoProcessing::VideoTranslationService: Caption translation to FR/ES/PL
- TranscriptionPolisherService: Fallback regexp-based text corrections
- VideoProcessing::SeoService: AI-powered SEO content generation
- AssemblyaiClient: AssemblyAI API integration (transcription + LLM Gateway)
- VideoTranscriptionWorker: Background job processing
AI Processing via AssemblyAI LLM Gateway
As of December 2025, all AI processing uses AssemblyAI's LLM Gateway:
| Task | Previously | Now |
|---|---|---|
| Caption Polishing | Regex only | LeMUR (Claude) + regex fallback |
| Paragraph Generation | OpenAI GPT-4 | LLM Gateway (Claude) |
| Translation | DeepL API | LLM Gateway (Claude) |
| SEO Generation | OpenAI GPT-4 | OpenAI GPT-4o (unchanged) |
This consolidation provides:
- Consistent quality: Same AI model for all text processing
- Context awareness: LLM understands caption timing and flow
- Better translations: Context-aware, preserves brand names
- Simpler architecture: Single API for most AI tasks
Video Upload Process
Overview
The video upload process uses Cloudflare Stream's direct creator upload feature, providing a seamless experience from Uppy to Heatwave to Cloudflare.
Sequence Diagram
The upload process follows this sequence:
- Uppy Initialization: Client-side uploader setup
- Heatwave Processing: Server-side video processing
- Cloudflare Storage: Final video storage and streaming
Useful Links
Video Transcription System
Overview
The video transcription system provides high-quality transcription with speaker diarization, timestamps, and SEO content generation using AssemblyAI.
Architecture
Services
- VideoProcessing::TranscriptionService - Core transcription service with granular methods
- TranscriptionPolisherService - Regexp-based text corrections and company terminology
- VideoProcessing::SeoService - SEO content generation using RubyLLM (OpenAI)
- AssemblyaiClient - Client for interacting with AssemblyAI API
- VideoTranscriptionWorker - Background job for comprehensive transcription workflow
Service Responsibilities
- AudioExtractionService: Pure audio extraction from file paths (reusable, testable)
- VideoProcessing::AudioExtractionService: Video-specific audio extraction with upload storage
- VideoProcessing::TranscriptionService: Core transcription logic with granular methods
- TranscriptionPolisherService: Fast, reliable regexp-based text corrections
- VideoProcessing::SeoService: Generates SEO content using RubyLLM (OpenAI GPT-4o with structured JSON output)
- AssemblyaiClient: Handles all AssemblyAI API interactions
- VideoTranscriptionWorker: Background job orchestrator with progress tracking
Three-Step Workflow
Step 1: Retrieve Original VTT and Sentences from AssemblyAI
- Downloads raw VTT captions from AssemblyAI's
/v2/transcript/:transcript_id/vttendpoint - Retrieves semantically segmented sentences from
/v2/transcript/:transcript_id/sentencesendpoint - Stores data as
vtt_originalandsentencesinstructured_transcript_json - Ensures transcription status is
completedbefore proceeding
Step 2: Polish Transcript and Generate Paragraphs
- Uses AssemblyAI LLM Gateway (Claude) for AI-powered polishing:
- Company terminology corrections (e.g., "Warmly Yours" → "WarmlyYours")
- Grammar, punctuation, and typo fixes
- Context-aware corrections that understand caption flow
- Falls back to
TranscriptionPolisherService(regex) if LLM fails
- Stores polished data as
vtt_polishedinstructured_transcript_json - Uses LLM Gateway to generate natural paragraphs from polished text
- Creates HTML transcript for video page display
- Saves HTML transcript to
video.transcriptfield
Prompts are configurable via Settings:
video_processing_polish_system_promptvideo_processing_polish_user_promptvideo_processing_paragraph_system_promptvideo_processing_paragraph_user_prompt
Step 3: Generate SEO Metadata
- Uses AI to create SEO-friendly content from transcript:
meta_title(50-60 characters)meta_description(150-160 characters)sub_header(100-150 characters)expanded_description(200-300 words)
- Updates video model fields directly
Features
Transcription Options Interface
The system provides a granular transcription options interface that allows users to:
- Select specific steps: Choose which parts of the transcription workflow to execute
- Configure speaker detection: Set the expected number of speakers (1-10) for improved accuracy, or use "Auto Detect" for automatic speaker detection
- Conditional execution: Skip steps that have already been completed
- Progress tracking: Monitor job progress with detailed status updates
Speaker Diarization
- Automatic speaker detection: Identifies different speakers in the audio
- Speaker labeling: Labels speakers as "Speaker A", "Speaker B", etc.
- Configurable speaker count: Users can specify expected number of speakers (1-10) for improved accuracy
- Speaker statistics: Calculates talk time and word count for each speaker
Structured Data
The service retrieves and stores complete transcript data from AssemblyAI, including:
{
"id": "transcript_id",
"status": "completed",
"confidence": 0.946,
"audio_duration": 483.2,
"utterances": [
{
"confidence": 0.98,
"end": 5000,
"speaker": "A",
"start": 0,
"text": "Hello, welcome to our video."
}
]
}
Usage Examples
Basic Transcription
# Initialize service
transcription_service = VideoProcessing::TranscriptionService.new(video)
# Extract audio and submit for transcription
transcription_service.extract_audio
transcription_service.submit_transcription
# Retrieve and process transcript
transcription_service.retrieve_and_overwrite_structured_transcript
transcription_service.polish_transcript_with_company_terminology
transcription_service.
Background Processing
# Queue transcription job
VideoTranscriptionWorker.perform_async(video.id, )
# Monitor progress
VideoTranscriptionWorker.new.perform(video.id, )
VTT Generation
Overview
The system generates VTT (WebVTT) caption files dynamically from the polished structured transcript JSON instead of storing them as uploads. This ensures that captions contain the same corrections and improvements applied to the transcript text.
Problem Solved
Previously, VTT files were retrieved directly from AssemblyAI using raw transcript data and stored as uploads. However, the structured transcript JSON goes through a polishing process that:
- Fixes grammar and spelling mistakes
- Corrects company terminology (e.g., "Warmly Yours" → "WarmlyYours")
- Improves sentence structure and readability
The raw VTT file didn't include these corrections, creating a mismatch between transcript and captions.
Solution
The new system generates VTT captions dynamically on-demand from the structured transcript JSON, ensuring that:
- Captions match the polished transcript exactly
- Timing information is preserved from the original structured data
- Company terminology corrections are applied consistently
- VTT files are always current and don't require regeneration
Implementation
Key Methods
VideoProcessing::TranscriptionService#generate_vtt_content_from_structured_transcript: Generates VTT content from structured transcript JSONVideoProcessing::TranscriptionService#generate_vtt_content_from_polished_vtt: Creates VTT content from polished VTT dataVideosController#download_vtt: Controller action that generates and serves VTT files
Caption Formatting
The system creates captions with:
- Timing: Preserves original start/end timestamps from polished VTT data
- Text: Uses polished text with company terminology corrections
- VTT format: Standard WebVTT format with proper timestamps
Example VTT Output
WEBVTT
1
00:00:00.000 --> 00:00:05.000
Hello, welcome to our video about floor heating systems.
2
00:00:05.000 --> 00:00:10.000
Today we will discuss the benefits of radiant floor heating.
Usage
For New Transcriptions
VTT captions are generated dynamically from polished data when requested.
For Existing Videos
VTT captions are automatically available for any video with structured transcript JSON data containing polished VTT.
Download VTT Files
- Navigate to the video show page
- Go to the "Transcript" tab
- Click "Download Original VTT" or "Download Polished VTT" in the respective panels
Programmatically
# Generate VTT content for a specific video
video = Video.find(video_id)
service = VideoProcessing::TranscriptionService.new(video)
vtt_content = service.generate_vtt_content_from_structured_transcript
# Download VTT file via controller action
# GET /videos/:id/download_vtt?type=original
# GET /videos/:id/download_vtt?type=polished
Rake Tasks
Overview
All video-related rake tasks are consolidated in lib/tasks/video.rake for easy management and organization.
Available Tasks
Transcription Tasks
video:transcription:process[VIDEO_ID]- Process transcription for specific videovideo:transcription:process_all- Process all videos without transcriptsvideo:transcription:process_by_category[CAT]- Process videos by categoryvideo:transcription:process_with_limit[LIMIT]- Process videos with limitvideo:transcription:stats- Show transcription statistics
VTT Processing Tasks
video:vtt:retrieve_transcript[VIDEO_ID]- Step 1: Retrieve from AssemblyAIvideo:vtt:polish_transcript[VIDEO_ID]- Step 2: Polish with terminologyvideo:vtt:summarize_video[VIDEO_ID]- Step 3: Generate metadatavideo:vtt:test_processing[VIDEO_ID]- Test full workflowvideo:vtt:process_all- Process all VTTvideo:vtt:extract_and_transcribe- Extract audio & submit for transcriptionvideo:vtt:list_available- List videos with structured datavideo:vtt:test_generation[VIDEO_ID]- Test VTT generation
General Tasks
video:stats- Show comprehensive statisticsvideo:help- Show help message with all available tasks
Usage Examples
# Show all available tasks
bundle exec rake video:help
# Process transcription for specific video
bundle exec rake video:transcription:process[12345]
# Extract audio and submit for transcription
bundle exec rake video:vtt:extract_and_transcribe
# Show video statistics
bundle exec rake video:stats
API Integration
AssemblyAI Integration
The system integrates with AssemblyAI for high-quality transcription services.
Key Features
- High Accuracy: Advanced speech recognition with 95%+ accuracy
- Speaker Diarization: Automatic speaker identification and labeling
- Timestamps: Precise word-level and segment-level timing
- Multiple Formats: Support for various audio and video formats
API Endpoints Used
/v2/transcript- Submit transcription jobs/v2/transcript/:id- Get transcription status and results/v2/transcript/:id/vtt- Get VTT captions/v2/transcript/:id/sentences- Get semantically segmented sentences
Configuration
# AssemblyAI client configuration
AssemblyaiClient.new(
api_key: ENV['ASSEMBLYAI_API_KEY'],
base_url: 'https://api.assemblyai.com/v2'
)
AssemblyAI LLM Gateway Integration
The system uses AssemblyAI's LLM Gateway for caption polishing, paragraph generation, and translations.
Configuration
All prompts are stored in Setting and editable via the CRM settings page:
| Setting | Purpose |
|---|---|
video_processing_llm_model |
LLM model (default: claude-sonnet-4-5-20250929) |
video_processing_llm_max_tokens |
Max tokens (default: 8000) |
video_processing_llm_temperature |
Temperature (default: 0.2) |
video_processing_polish_system_prompt |
Caption polishing system prompt |
video_processing_polish_user_prompt |
Caption polishing user prompt |
video_processing_paragraph_system_prompt |
Paragraph generation system prompt |
video_processing_paragraph_user_prompt |
Paragraph generation user prompt |
video_processing_translate_system_prompt |
Translation system prompt |
video_processing_translate_user_prompt |
Translation user prompt |
transcription_spelling_corrections |
Shared terminology corrections |
Usage
# Caption polishing via LLM Gateway
transcription_service = VideoProcessing::TranscriptionService.new(video)
transcription_service.polish_vtt_text(vtt_original) # Returns polished VTT array
# Translation via LLM Gateway
translation_service = VideoProcessing::VideoTranslationService.new(video)
translation_service.translate_vtt_to_locale('fr-CA', 1, 3) # French-Canadian
# Paragraph generation via LLM Gateway
transcription_service.generate_paragraphs_from_polished_text(vtt_polished)
OpenAI Integration
The system uses RubyLLM (configured for OpenAI GPT-4o) for SEO content generation only.
Configuration
The API key is retrieved from Heatwave::Configuration.fetch(:openai, :api_key) and the SEO prompt template is stored in the database via Setting.video_processing_seo_prompt, making it editable through the admin UI.
Usage
# SEO content generation
seo_service = VideoProcessing::SeoService.new(video)
seo_content = seo_service.generate_seo_content
# Returns: { 'status' => 'success', 'sub_header' => '...', 'meta_title' => '...',
# 'meta_description' => '...', 'expanded_description' => '...' }
# Called automatically during transcription workflow
transcription_service = VideoProcessing::TranscriptionService.new(video)
transcription_service.
Features
- Structured JSON Output: Uses GPT-4o with
response_format: { type: 'json_object' }for reliable parsing - Database-Driven Prompts: Editable prompt templates stored in settings
- Character Limit Validation: Automatic validation against SEO best practices
- Context Preservation: Incorporates existing metadata and video title for consistency
UI Components
Video Player Component
The system includes a reusable video player component for consistent video playback across the application.
Transcript Display
The transcript interface provides:
- Structured Data Panels: Separate panels for original VTT, polished VTT, sentences, and paragraphs
- Download Options: Direct download links for VTT files and structured data
- HTML Preview: Formatted transcript display for video pages
- Status Indicators: Real-time status updates for transcription progress
Transcription Options Interface
The transcription options page allows users to:
- Select Processing Steps: Choose which transcription steps to execute
- Configure Settings: Set speaker detection and other parameters
- Monitor Progress: Track job status and completion
- View Results: Access generated transcripts and metadata
Troubleshooting
Common Issues
Transcription Failures
- Check AssemblyAI Status: Verify transcription is completed in AssemblyAI dashboard
- Audio Extraction: Ensure video has audio track and extraction was successful
- API Limits: Check AssemblyAI API usage and limits
- File Format: Verify video format is supported by AssemblyAI
VTT Generation Issues
- Structured Data: Ensure video has structured transcript JSON data
- Polished VTT: Check that polished VTT data exists for generation
- Timing Data: Verify timing information is preserved in structured data
Background Job Issues
- Sidekiq Status: Check Sidekiq worker status and queue
- Job Logs: Review worker logs for error details
- Memory Usage: Monitor system resources during processing
Debug Commands
# Check video transcription status
bundle exec rake video:transcription:stats
# Test VTT generation for specific video
bundle exec rake video:vtt:test_generation[VIDEO_ID]
# Process specific video step by step
bundle exec rake video:vtt:retrieve_transcript[VIDEO_ID]
bundle exec rake video:vtt:polish_transcript[VIDEO_ID]
bundle exec rake video:vtt:summarize_video[VIDEO_ID]
Log Analysis
Key log entries to monitor:
VideoProcessing::TranscriptionService- Transcription service operationsVideoTranscriptionWorker- Background job processingAssemblyaiClient- API interaction logsTranscriptionPolisherService- Text correction operations
Best Practices
Performance Optimization
- Batch Processing: Use background jobs for large-scale transcription
- Caching: Cache generated VTT content for frequently accessed videos
- Resource Management: Monitor API usage and system resources
- Error Handling: Implement robust error handling and retry logic
Data Management
- Structured Storage: Use JSONB for flexible structured transcript storage
- Backup Strategy: Regular backups of transcription data
- Cleanup: Remove temporary files and unused uploads
- Validation: Validate transcription quality and completeness
Security Considerations
- API Keys: Secure storage of AssemblyAI and OpenAI API keys
- Access Control: Proper authorization for transcription operations
- Data Privacy: Ensure compliance with data protection regulations
- Audit Logging: Track transcription operations for security monitoring
Future Enhancements
Completed Features (December 2025)
- Multi-language Support: Caption translation to French (Quebec), Spanish (Mexico), Polish via LLM Gateway
- AI-Powered Polishing: LeMUR-based caption polishing with context awareness
- Unified AI Pipeline: Single AssemblyAI integration for transcription + AI processing
Planned Features
- Advanced Analytics: Detailed transcription analytics and insights
- Custom Models: Training custom transcription models for domain-specific content
- Real-time Processing: Live transcription for streaming content
Integration Opportunities
- Content Management: Integration with CMS for automated content updates
- Search Optimization: Enhanced search capabilities using transcript data
- Accessibility: Improved accessibility features using transcript data
- Analytics: Advanced analytics and reporting capabilities