Video System Documentation
This document provides comprehensive documentation for the video system, including upload, transcription, processing, and management features.
Table of Contents
Section titled “Table of Contents”- Overview
- Video Upload Process
- Video Transcription System
- VTT Generation
- Rake Tasks
- API Integration
- UI Components
- Troubleshooting
Overview
Section titled “Overview”The video system provides comprehensive video management capabilities including:
- Video Upload: Direct creator uploads via Cloudflare Stream
- Transcription: High-quality transcription with AssemblyAI
- VTT Generation: Dynamic caption generation from structured transcripts
- SEO Optimization: Automated metadata generation
- Background Processing: Scalable job processing with Sidekiq
Key Components
Section titled “Key Components”- Video Model: Core data model with structured transcript JSON storage
- VideoProcessing::TranscriptionService: Main transcription orchestration
- VideoProcessing::VideoTranslationService: Caption translation to FR/ES/PL
- TranscriptionPolisherService: Fallback regexp-based text corrections
- VideoProcessing::SeoService: AI-powered SEO content generation
- AssemblyaiClient: AssemblyAI API integration (transcription + LLM Gateway)
- VideoTranscriptionWorker: Background job processing
AI Processing via AssemblyAI LLM Gateway
Section titled “AI Processing via AssemblyAI LLM Gateway”As of December 2025, all AI processing uses AssemblyAI’s LLM Gateway:
| Task | Previously | Now |
|---|---|---|
| Caption Polishing | Regex only | LeMUR (Claude) + regex fallback |
| Paragraph Generation | OpenAI GPT-4 | LLM Gateway (Claude) |
| Translation | DeepL API | LLM Gateway (Claude) |
| SEO Generation | OpenAI GPT-4 | OpenAI GPT-4o (unchanged) |
This consolidation provides:
- Consistent quality: Same AI model for all text processing
- Context awareness: LLM understands caption timing and flow
- Better translations: Context-aware, preserves brand names
- Simpler architecture: Single API for most AI tasks
Video Upload Process
Section titled “Video Upload Process”Overview
Section titled “Overview”The video upload process uses Cloudflare Stream’s direct creator upload feature, providing a seamless experience from Uppy to Heatwave to Cloudflare.
Sequence Diagram
Section titled “Sequence Diagram”The upload process follows this sequence:
- Uppy Initialization: Client-side uploader setup
- Heatwave Processing: Server-side video processing
- Cloudflare Storage: Final video storage and streaming
Useful Links
Section titled “Useful Links”Video Transcription System
Section titled “Video Transcription System”Overview
Section titled “Overview”The video transcription system provides high-quality transcription with speaker diarization, timestamps, and SEO content generation using AssemblyAI.
Architecture
Section titled “Architecture”Services
Section titled “Services”- VideoProcessing::TranscriptionService - Core transcription service with granular methods
- TranscriptionPolisherService - Regexp-based text corrections and company terminology
- VideoProcessing::SeoService - SEO content generation using RubyLLM (OpenAI)
- AssemblyaiClient - Client for interacting with AssemblyAI API
- VideoTranscriptionWorker - Background job for comprehensive transcription workflow
Service Responsibilities
Section titled “Service Responsibilities”- AudioExtractionService: Pure audio extraction from file paths (reusable, testable)
- VideoProcessing::AudioExtractionService: Video-specific audio extraction with upload storage
- VideoProcessing::TranscriptionService: Core transcription logic with granular methods
- TranscriptionPolisherService: Fast, reliable regexp-based text corrections
- VideoProcessing::SeoService: Generates SEO content using RubyLLM (OpenAI GPT-4o with structured JSON output)
- AssemblyaiClient: Handles all AssemblyAI API interactions
- VideoTranscriptionWorker: Background job orchestrator with progress tracking
Three-Step Workflow
Section titled “Three-Step Workflow”Step 1: Retrieve Original VTT and Sentences from AssemblyAI
Section titled “Step 1: Retrieve Original VTT and Sentences from AssemblyAI”- Downloads raw VTT captions from AssemblyAI’s
/v2/transcript/:transcript_id/vttendpoint - Retrieves semantically segmented sentences from
/v2/transcript/:transcript_id/sentencesendpoint - Stores data as
vtt_originalandsentencesinstructured_transcript_json - Ensures transcription status is
completedbefore proceeding
Step 2: Polish Transcript and Generate Paragraphs
Section titled “Step 2: Polish Transcript and Generate Paragraphs”- Uses AssemblyAI LLM Gateway (Claude) for AI-powered polishing:
- Company terminology corrections (e.g., “Warmly Yours” → “WarmlyYours”)
- Grammar, punctuation, and typo fixes
- Context-aware corrections that understand caption flow
- Falls back to
TranscriptionPolisherService(regex) if LLM fails
- Stores polished data as
vtt_polishedinstructured_transcript_json - Uses LLM Gateway to generate natural paragraphs from polished text
- Creates HTML transcript for video page display
- Saves HTML transcript to
video.transcriptfield
Prompts are configurable via Settings:
video_processing_polish_system_promptvideo_processing_polish_user_promptvideo_processing_paragraph_system_promptvideo_processing_paragraph_user_prompt
Step 3: Generate SEO Metadata
Section titled “Step 3: Generate SEO Metadata”- Uses AI to create SEO-friendly content from transcript:
meta_title(50-60 characters)meta_description(150-160 characters)sub_header(100-150 characters)expanded_description(200-300 words)
- Updates video model fields directly
Features
Section titled “Features”Transcription Options Interface
Section titled “Transcription Options Interface”The system provides a granular transcription options interface that allows users to:
- Select specific steps: Choose which parts of the transcription workflow to execute
- Configure speaker detection: Set the expected number of speakers (1-10) for improved accuracy, or use “Auto Detect” for automatic speaker detection
- Conditional execution: Skip steps that have already been completed
- Progress tracking: Monitor job progress with detailed status updates
Speaker Diarization
Section titled “Speaker Diarization”- Automatic speaker detection: Identifies different speakers in the audio
- Speaker labeling: Labels speakers as “Speaker A”, “Speaker B”, etc.
- Configurable speaker count: Users can specify expected number of speakers (1-10) for improved accuracy
- Speaker statistics: Calculates talk time and word count for each speaker
Structured Data
Section titled “Structured Data”The service retrieves and stores complete transcript data from AssemblyAI, including:
{ "id": "transcript_id", "status": "completed", "confidence": 0.946, "audio_duration": 483.2, "utterances": [ { "confidence": 0.98, "end": 5000, "speaker": "A", "start": 0, "text": "Hello, welcome to our video." } ]}Usage Examples
Section titled “Usage Examples”Basic Transcription
Section titled “Basic Transcription”# Initialize servicetranscription_service = VideoProcessing::TranscriptionService.new(video)
# Extract audio and submit for transcriptiontranscription_service.extract_audiotranscription_service.submit_transcription
# Retrieve and process transcripttranscription_service.retrieve_and_overwrite_structured_transcripttranscription_service.polish_transcript_with_company_terminologytranscription_service.summarize_video_and_update_metadataBackground Processing
Section titled “Background Processing”# Queue transcription jobVideoTranscriptionWorker.perform_async(video.id, options)
# Monitor progressVideoTranscriptionWorker.new.perform(video.id, options)VTT Generation
Section titled “VTT Generation”Overview
Section titled “Overview”The system generates VTT (WebVTT) caption files dynamically from the polished structured transcript JSON instead of storing them as uploads. This ensures that captions contain the same corrections and improvements applied to the transcript text.
Problem Solved
Section titled “Problem Solved”Previously, VTT files were retrieved directly from AssemblyAI using raw transcript data and stored as uploads. However, the structured transcript JSON goes through a polishing process that:
- Fixes grammar and spelling mistakes
- Corrects company terminology (e.g., “Warmly Yours” → “WarmlyYours”)
- Improves sentence structure and readability
The raw VTT file didn’t include these corrections, creating a mismatch between transcript and captions.
Solution
Section titled “Solution”The new system generates VTT captions dynamically on-demand from the structured transcript JSON, ensuring that:
- Captions match the polished transcript exactly
- Timing information is preserved from the original structured data
- Company terminology corrections are applied consistently
- VTT files are always current and don’t require regeneration
Implementation
Section titled “Implementation”Key Methods
Section titled “Key Methods”VideoProcessing::TranscriptionService#generate_vtt_content_from_structured_transcript: Generates VTT content from structured transcript JSONVideoProcessing::TranscriptionService#generate_vtt_content_from_polished_vtt: Creates VTT content from polished VTT dataVideosController#download_vtt: Controller action that generates and serves VTT files
Caption Formatting
Section titled “Caption Formatting”The system creates captions with:
- Timing: Preserves original start/end timestamps from polished VTT data
- Text: Uses polished text with company terminology corrections
- VTT format: Standard WebVTT format with proper timestamps
Example VTT Output
Section titled “Example VTT Output”WEBVTT
100:00:00.000 --> 00:00:05.000Hello, welcome to our video about floor heating systems.
200:00:05.000 --> 00:00:10.000Today we will discuss the benefits of radiant floor heating.For New Transcriptions
Section titled “For New Transcriptions”VTT captions are generated dynamically from polished data when requested.
For Existing Videos
Section titled “For Existing Videos”VTT captions are automatically available for any video with structured transcript JSON data containing polished VTT.
Download VTT Files
Section titled “Download VTT Files”- Navigate to the video show page
- Go to the “Transcript” tab
- Click “Download Original VTT” or “Download Polished VTT” in the respective panels
Programmatically
Section titled “Programmatically”# Generate VTT content for a specific videovideo = Video.find(video_id)service = VideoProcessing::TranscriptionService.new(video)vtt_content = service.generate_vtt_content_from_structured_transcript
# Download VTT file via controller action# GET /videos/:id/download_vtt?type=original# GET /videos/:id/download_vtt?type=polishedRake Tasks
Section titled “Rake Tasks”Overview
Section titled “Overview”All video-related rake tasks are consolidated in lib/tasks/video.rake for easy management and organization.
Available Tasks
Section titled “Available Tasks”Transcription Tasks
Section titled “Transcription Tasks”video:transcription:process[VIDEO_ID]- Process transcription for specific videovideo:transcription:process_all- Process all videos without transcriptsvideo:transcription:process_by_category[CAT]- Process videos by categoryvideo:transcription:process_with_limit[LIMIT]- Process videos with limitvideo:transcription:stats- Show transcription statistics
VTT Processing Tasks
Section titled “VTT Processing Tasks”video:vtt:retrieve_transcript[VIDEO_ID]- Step 1: Retrieve from AssemblyAIvideo:vtt:polish_transcript[VIDEO_ID]- Step 2: Polish with terminologyvideo:vtt:summarize_video[VIDEO_ID]- Step 3: Generate metadatavideo:vtt:test_processing[VIDEO_ID]- Test full workflowvideo:vtt:process_all- Process all VTTvideo:vtt:extract_and_transcribe- Extract audio & submit for transcriptionvideo:vtt:list_available- List videos with structured datavideo:vtt:test_generation[VIDEO_ID]- Test VTT generation
General Tasks
Section titled “General Tasks”video:stats- Show comprehensive statisticsvideo:help- Show help message with all available tasks
Usage Examples
Section titled “Usage Examples”# Show all available tasksbundle exec rake video:help
# Process transcription for specific videobundle exec rake video:transcription:process[12345]
# Extract audio and submit for transcriptionbundle exec rake video:vtt:extract_and_transcribe
# Show video statisticsbundle exec rake video:statsAPI Integration
Section titled “API Integration”AssemblyAI Integration
Section titled “AssemblyAI Integration”The system integrates with AssemblyAI for high-quality transcription services.
Key Features
Section titled “Key Features”- High Accuracy: Advanced speech recognition with 95%+ accuracy
- Speaker Diarization: Automatic speaker identification and labeling
- Timestamps: Precise word-level and segment-level timing
- Multiple Formats: Support for various audio and video formats
API Endpoints Used
Section titled “API Endpoints Used”/v2/transcript- Submit transcription jobs/v2/transcript/:id- Get transcription status and results/v2/transcript/:id/vtt- Get VTT captions/v2/transcript/:id/sentences- Get semantically segmented sentences
Configuration
Section titled “Configuration”# AssemblyAI client configurationAssemblyaiClient.new( api_key: ENV['ASSEMBLYAI_API_KEY'], base_url: 'https://api.assemblyai.com/v2')AssemblyAI LLM Gateway Integration
Section titled “AssemblyAI LLM Gateway Integration”The system uses AssemblyAI’s LLM Gateway for caption polishing, paragraph generation, and translations.
Configuration
Section titled “Configuration”All prompts are stored in Setting and editable via the CRM settings page:
| Setting | Purpose |
|---|---|
video_processing_llm_model | LLM model (default: claude-sonnet-4-5-20250929) |
video_processing_llm_max_tokens | Max tokens (default: 8000) |
video_processing_llm_temperature | Temperature (default: 0.2) |
video_processing_polish_system_prompt | Caption polishing system prompt |
video_processing_polish_user_prompt | Caption polishing user prompt |
video_processing_paragraph_system_prompt | Paragraph generation system prompt |
video_processing_paragraph_user_prompt | Paragraph generation user prompt |
video_processing_translate_system_prompt | Translation system prompt |
video_processing_translate_user_prompt | Translation user prompt |
transcription_spelling_corrections | Shared terminology corrections |
# Caption polishing via LLM Gatewaytranscription_service = VideoProcessing::TranscriptionService.new(video)transcription_service.polish_vtt_text(vtt_original) # Returns polished VTT array
# Translation via LLM Gatewaytranslation_service = VideoProcessing::VideoTranslationService.new(video)translation_service.translate_vtt_to_locale('fr-CA', 1, 3) # French-Canadian
# Paragraph generation via LLM Gatewaytranscription_service.generate_paragraphs_from_polished_text(vtt_polished)OpenAI Integration
Section titled “OpenAI Integration”The system uses RubyLLM (configured for OpenAI GPT-4o) for SEO content generation only.
Configuration
Section titled “Configuration”The API key is retrieved from Heatwave::Configuration.fetch(:openai, :api_key) and the SEO prompt template is stored in the database via Setting.video_processing_seo_prompt, making it editable through the admin UI.
# SEO content generationseo_service = VideoProcessing::SeoService.new(video)seo_content = seo_service.generate_seo_content# Returns: { 'status' => 'success', 'sub_header' => '...', 'meta_title' => '...',# 'meta_description' => '...', 'expanded_description' => '...' }
# Called automatically during transcription workflowtranscription_service = VideoProcessing::TranscriptionService.new(video)transcription_service.summarize_video_and_update_metadataFeatures
Section titled “Features”- Structured JSON Output: Uses GPT-4o with
response_format: { type: 'json_object' }for reliable parsing - Database-Driven Prompts: Editable prompt templates stored in settings
- Character Limit Validation: Automatic validation against SEO best practices
- Context Preservation: Incorporates existing metadata and video title for consistency
UI Components
Section titled “UI Components”Video Player Component
Section titled “Video Player Component”The system includes a reusable video player component for consistent video playback across the application.
Transcript Display
Section titled “Transcript Display”The transcript interface provides:
- Structured Data Panels: Separate panels for original VTT, polished VTT, sentences, and paragraphs
- Download Options: Direct download links for VTT files and structured data
- HTML Preview: Formatted transcript display for video pages
- Status Indicators: Real-time status updates for transcription progress
Transcription Options Interface
Section titled “Transcription Options Interface”The transcription options page allows users to:
- Select Processing Steps: Choose which transcription steps to execute
- Configure Settings: Set speaker detection and other parameters
- Monitor Progress: Track job status and completion
- View Results: Access generated transcripts and metadata
Troubleshooting
Section titled “Troubleshooting”Common Issues
Section titled “Common Issues”Transcription Failures
Section titled “Transcription Failures”- Check AssemblyAI Status: Verify transcription is completed in AssemblyAI dashboard
- Audio Extraction: Ensure video has audio track and extraction was successful
- API Limits: Check AssemblyAI API usage and limits
- File Format: Verify video format is supported by AssemblyAI
VTT Generation Issues
Section titled “VTT Generation Issues”- Structured Data: Ensure video has structured transcript JSON data
- Polished VTT: Check that polished VTT data exists for generation
- Timing Data: Verify timing information is preserved in structured data
Background Job Issues
Section titled “Background Job Issues”- Sidekiq Status: Check Sidekiq worker status and queue
- Job Logs: Review worker logs for error details
- Memory Usage: Monitor system resources during processing
Debug Commands
Section titled “Debug Commands”# Check video transcription statusbundle exec rake video:transcription:stats
# Test VTT generation for specific videobundle exec rake video:vtt:test_generation[VIDEO_ID]
# Process specific video step by stepbundle exec rake video:vtt:retrieve_transcript[VIDEO_ID]bundle exec rake video:vtt:polish_transcript[VIDEO_ID]bundle exec rake video:vtt:summarize_video[VIDEO_ID]Log Analysis
Section titled “Log Analysis”Key log entries to monitor:
VideoProcessing::TranscriptionService- Transcription service operationsVideoTranscriptionWorker- Background job processingAssemblyaiClient- API interaction logsTranscriptionPolisherService- Text correction operations
Best Practices
Section titled “Best Practices”Performance Optimization
Section titled “Performance Optimization”- Batch Processing: Use background jobs for large-scale transcription
- Caching: Cache generated VTT content for frequently accessed videos
- Resource Management: Monitor API usage and system resources
- Error Handling: Implement robust error handling and retry logic
Data Management
Section titled “Data Management”- Structured Storage: Use JSONB for flexible structured transcript storage
- Backup Strategy: Regular backups of transcription data
- Cleanup: Remove temporary files and unused uploads
- Validation: Validate transcription quality and completeness
Security Considerations
Section titled “Security Considerations”- API Keys: Secure storage of AssemblyAI and OpenAI API keys
- Access Control: Proper authorization for transcription operations
- Data Privacy: Ensure compliance with data protection regulations
- Audit Logging: Track transcription operations for security monitoring
Future Enhancements
Section titled “Future Enhancements”Completed Features (December 2025)
Section titled “Completed Features (December 2025)”- Multi-language Support: Caption translation to French (Quebec), Spanish (Mexico), Polish via LLM Gateway
- AI-Powered Polishing: LeMUR-based caption polishing with context awareness
- Unified AI Pipeline: Single AssemblyAI integration for transcription + AI processing
Planned Features
Section titled “Planned Features”- Advanced Analytics: Detailed transcription analytics and insights
- Custom Models: Training custom transcription models for domain-specific content
- Real-time Processing: Live transcription for streaming content
Integration Opportunities
Section titled “Integration Opportunities”- Content Management: Integration with CMS for automated content updates
- Search Optimization: Enhanced search capabilities using transcript data
- Accessibility: Improved accessibility features using transcript data
- Analytics: Advanced analytics and reporting capabilities