Skip to content

Video System Documentation

This document provides comprehensive documentation for the video system, including upload, transcription, processing, and management features.


The video system provides comprehensive video management capabilities including:

  • Video Upload: Direct creator uploads via Cloudflare Stream
  • Transcription: High-quality transcription with AssemblyAI
  • VTT Generation: Dynamic caption generation from structured transcripts
  • SEO Optimization: Automated metadata generation
  • Background Processing: Scalable job processing with Sidekiq
  • Video Model: Core data model with structured transcript JSON storage
  • VideoProcessing::TranscriptionService: Main transcription orchestration
  • VideoProcessing::VideoTranslationService: Caption translation to FR/ES/PL
  • TranscriptionPolisherService: Fallback regexp-based text corrections
  • VideoProcessing::SeoService: AI-powered SEO content generation
  • AssemblyaiClient: AssemblyAI API integration (transcription + LLM Gateway)
  • VideoTranscriptionWorker: Background job processing

As of December 2025, all AI processing uses AssemblyAI’s LLM Gateway:

TaskPreviouslyNow
Caption PolishingRegex onlyLeMUR (Claude) + regex fallback
Paragraph GenerationOpenAI GPT-4LLM Gateway (Claude)
TranslationDeepL APILLM Gateway (Claude)
SEO GenerationOpenAI GPT-4OpenAI GPT-4o (unchanged)

This consolidation provides:

  • Consistent quality: Same AI model for all text processing
  • Context awareness: LLM understands caption timing and flow
  • Better translations: Context-aware, preserves brand names
  • Simpler architecture: Single API for most AI tasks

The video upload process uses Cloudflare Stream’s direct creator upload feature, providing a seamless experience from Uppy to Heatwave to Cloudflare.

The upload process follows this sequence:

  1. Uppy Initialization: Client-side uploader setup
  2. Heatwave Processing: Server-side video processing
  3. Cloudflare Storage: Final video storage and streaming

Video Upload Process


The video transcription system provides high-quality transcription with speaker diarization, timestamps, and SEO content generation using AssemblyAI.

  1. VideoProcessing::TranscriptionService - Core transcription service with granular methods
  2. TranscriptionPolisherService - Regexp-based text corrections and company terminology
  3. VideoProcessing::SeoService - SEO content generation using RubyLLM (OpenAI)
  4. AssemblyaiClient - Client for interacting with AssemblyAI API
  5. VideoTranscriptionWorker - Background job for comprehensive transcription workflow
  • AudioExtractionService: Pure audio extraction from file paths (reusable, testable)
  • VideoProcessing::AudioExtractionService: Video-specific audio extraction with upload storage
  • VideoProcessing::TranscriptionService: Core transcription logic with granular methods
  • TranscriptionPolisherService: Fast, reliable regexp-based text corrections
  • VideoProcessing::SeoService: Generates SEO content using RubyLLM (OpenAI GPT-4o with structured JSON output)
  • AssemblyaiClient: Handles all AssemblyAI API interactions
  • VideoTranscriptionWorker: Background job orchestrator with progress tracking

Step 1: Retrieve Original VTT and Sentences from AssemblyAI

Section titled “Step 1: Retrieve Original VTT and Sentences from AssemblyAI”
  • Downloads raw VTT captions from AssemblyAI’s /v2/transcript/:transcript_id/vtt endpoint
  • Retrieves semantically segmented sentences from /v2/transcript/:transcript_id/sentences endpoint
  • Stores data as vtt_original and sentences in structured_transcript_json
  • Ensures transcription status is completed before proceeding

Step 2: Polish Transcript and Generate Paragraphs

Section titled “Step 2: Polish Transcript and Generate Paragraphs”
  • Uses AssemblyAI LLM Gateway (Claude) for AI-powered polishing:
    • Company terminology corrections (e.g., “Warmly Yours” → “WarmlyYours”)
    • Grammar, punctuation, and typo fixes
    • Context-aware corrections that understand caption flow
    • Falls back to TranscriptionPolisherService (regex) if LLM fails
  • Stores polished data as vtt_polished in structured_transcript_json
  • Uses LLM Gateway to generate natural paragraphs from polished text
  • Creates HTML transcript for video page display
  • Saves HTML transcript to video.transcript field

Prompts are configurable via Settings:

  • video_processing_polish_system_prompt
  • video_processing_polish_user_prompt
  • video_processing_paragraph_system_prompt
  • video_processing_paragraph_user_prompt
  • Uses AI to create SEO-friendly content from transcript:
    • meta_title (50-60 characters)
    • meta_description (150-160 characters)
    • sub_header (100-150 characters)
    • expanded_description (200-300 words)
  • Updates video model fields directly

The system provides a granular transcription options interface that allows users to:

  • Select specific steps: Choose which parts of the transcription workflow to execute
  • Configure speaker detection: Set the expected number of speakers (1-10) for improved accuracy, or use “Auto Detect” for automatic speaker detection
  • Conditional execution: Skip steps that have already been completed
  • Progress tracking: Monitor job progress with detailed status updates
  • Automatic speaker detection: Identifies different speakers in the audio
  • Speaker labeling: Labels speakers as “Speaker A”, “Speaker B”, etc.
  • Configurable speaker count: Users can specify expected number of speakers (1-10) for improved accuracy
  • Speaker statistics: Calculates talk time and word count for each speaker

The service retrieves and stores complete transcript data from AssemblyAI, including:

{
"id": "transcript_id",
"status": "completed",
"confidence": 0.946,
"audio_duration": 483.2,
"utterances": [
{
"confidence": 0.98,
"end": 5000,
"speaker": "A",
"start": 0,
"text": "Hello, welcome to our video."
}
]
}
# Initialize service
transcription_service = VideoProcessing::TranscriptionService.new(video)
# Extract audio and submit for transcription
transcription_service.extract_audio
transcription_service.submit_transcription
# Retrieve and process transcript
transcription_service.retrieve_and_overwrite_structured_transcript
transcription_service.polish_transcript_with_company_terminology
transcription_service.summarize_video_and_update_metadata
# Queue transcription job
VideoTranscriptionWorker.perform_async(video.id, options)
# Monitor progress
VideoTranscriptionWorker.new.perform(video.id, options)

The system generates VTT (WebVTT) caption files dynamically from the polished structured transcript JSON instead of storing them as uploads. This ensures that captions contain the same corrections and improvements applied to the transcript text.

Previously, VTT files were retrieved directly from AssemblyAI using raw transcript data and stored as uploads. However, the structured transcript JSON goes through a polishing process that:

  1. Fixes grammar and spelling mistakes
  2. Corrects company terminology (e.g., “Warmly Yours” → “WarmlyYours”)
  3. Improves sentence structure and readability

The raw VTT file didn’t include these corrections, creating a mismatch between transcript and captions.

The new system generates VTT captions dynamically on-demand from the structured transcript JSON, ensuring that:

  • Captions match the polished transcript exactly
  • Timing information is preserved from the original structured data
  • Company terminology corrections are applied consistently
  • VTT files are always current and don’t require regeneration
  • VideoProcessing::TranscriptionService#generate_vtt_content_from_structured_transcript: Generates VTT content from structured transcript JSON
  • VideoProcessing::TranscriptionService#generate_vtt_content_from_polished_vtt: Creates VTT content from polished VTT data
  • VideosController#download_vtt: Controller action that generates and serves VTT files

The system creates captions with:

  • Timing: Preserves original start/end timestamps from polished VTT data
  • Text: Uses polished text with company terminology corrections
  • VTT format: Standard WebVTT format with proper timestamps
WEBVTT
1
00:00:00.000 --> 00:00:05.000
Hello, welcome to our video about floor heating systems.
2
00:00:05.000 --> 00:00:10.000
Today we will discuss the benefits of radiant floor heating.

VTT captions are generated dynamically from polished data when requested.

VTT captions are automatically available for any video with structured transcript JSON data containing polished VTT.

  1. Navigate to the video show page
  2. Go to the “Transcript” tab
  3. Click “Download Original VTT” or “Download Polished VTT” in the respective panels
# Generate VTT content for a specific video
video = Video.find(video_id)
service = VideoProcessing::TranscriptionService.new(video)
vtt_content = service.generate_vtt_content_from_structured_transcript
# Download VTT file via controller action
# GET /videos/:id/download_vtt?type=original
# GET /videos/:id/download_vtt?type=polished

All video-related rake tasks are consolidated in lib/tasks/video.rake for easy management and organization.

  • video:transcription:process[VIDEO_ID] - Process transcription for specific video
  • video:transcription:process_all - Process all videos without transcripts
  • video:transcription:process_by_category[CAT] - Process videos by category
  • video:transcription:process_with_limit[LIMIT] - Process videos with limit
  • video:transcription:stats - Show transcription statistics
  • video:vtt:retrieve_transcript[VIDEO_ID] - Step 1: Retrieve from AssemblyAI
  • video:vtt:polish_transcript[VIDEO_ID] - Step 2: Polish with terminology
  • video:vtt:summarize_video[VIDEO_ID] - Step 3: Generate metadata
  • video:vtt:test_processing[VIDEO_ID] - Test full workflow
  • video:vtt:process_all - Process all VTT
  • video:vtt:extract_and_transcribe - Extract audio & submit for transcription
  • video:vtt:list_available - List videos with structured data
  • video:vtt:test_generation[VIDEO_ID] - Test VTT generation
  • video:stats - Show comprehensive statistics
  • video:help - Show help message with all available tasks
Terminal window
# Show all available tasks
bundle exec rake video:help
# Process transcription for specific video
bundle exec rake video:transcription:process[12345]
# Extract audio and submit for transcription
bundle exec rake video:vtt:extract_and_transcribe
# Show video statistics
bundle exec rake video:stats

The system integrates with AssemblyAI for high-quality transcription services.

  • High Accuracy: Advanced speech recognition with 95%+ accuracy
  • Speaker Diarization: Automatic speaker identification and labeling
  • Timestamps: Precise word-level and segment-level timing
  • Multiple Formats: Support for various audio and video formats
  • /v2/transcript - Submit transcription jobs
  • /v2/transcript/:id - Get transcription status and results
  • /v2/transcript/:id/vtt - Get VTT captions
  • /v2/transcript/:id/sentences - Get semantically segmented sentences
# AssemblyAI client configuration
AssemblyaiClient.new(
api_key: ENV['ASSEMBLYAI_API_KEY'],
base_url: 'https://api.assemblyai.com/v2'
)

The system uses AssemblyAI’s LLM Gateway for caption polishing, paragraph generation, and translations.

All prompts are stored in Setting and editable via the CRM settings page:

SettingPurpose
video_processing_llm_modelLLM model (default: claude-sonnet-4-5-20250929)
video_processing_llm_max_tokensMax tokens (default: 8000)
video_processing_llm_temperatureTemperature (default: 0.2)
video_processing_polish_system_promptCaption polishing system prompt
video_processing_polish_user_promptCaption polishing user prompt
video_processing_paragraph_system_promptParagraph generation system prompt
video_processing_paragraph_user_promptParagraph generation user prompt
video_processing_translate_system_promptTranslation system prompt
video_processing_translate_user_promptTranslation user prompt
transcription_spelling_correctionsShared terminology corrections
# Caption polishing via LLM Gateway
transcription_service = VideoProcessing::TranscriptionService.new(video)
transcription_service.polish_vtt_text(vtt_original) # Returns polished VTT array
# Translation via LLM Gateway
translation_service = VideoProcessing::VideoTranslationService.new(video)
translation_service.translate_vtt_to_locale('fr-CA', 1, 3) # French-Canadian
# Paragraph generation via LLM Gateway
transcription_service.generate_paragraphs_from_polished_text(vtt_polished)

The system uses RubyLLM (configured for OpenAI GPT-4o) for SEO content generation only.

The API key is retrieved from Heatwave::Configuration.fetch(:openai, :api_key) and the SEO prompt template is stored in the database via Setting.video_processing_seo_prompt, making it editable through the admin UI.

# SEO content generation
seo_service = VideoProcessing::SeoService.new(video)
seo_content = seo_service.generate_seo_content
# Returns: { 'status' => 'success', 'sub_header' => '...', 'meta_title' => '...',
# 'meta_description' => '...', 'expanded_description' => '...' }
# Called automatically during transcription workflow
transcription_service = VideoProcessing::TranscriptionService.new(video)
transcription_service.summarize_video_and_update_metadata
  • Structured JSON Output: Uses GPT-4o with response_format: { type: 'json_object' } for reliable parsing
  • Database-Driven Prompts: Editable prompt templates stored in settings
  • Character Limit Validation: Automatic validation against SEO best practices
  • Context Preservation: Incorporates existing metadata and video title for consistency

The system includes a reusable video player component for consistent video playback across the application.

The transcript interface provides:

  • Structured Data Panels: Separate panels for original VTT, polished VTT, sentences, and paragraphs
  • Download Options: Direct download links for VTT files and structured data
  • HTML Preview: Formatted transcript display for video pages
  • Status Indicators: Real-time status updates for transcription progress

The transcription options page allows users to:

  • Select Processing Steps: Choose which transcription steps to execute
  • Configure Settings: Set speaker detection and other parameters
  • Monitor Progress: Track job status and completion
  • View Results: Access generated transcripts and metadata

  1. Check AssemblyAI Status: Verify transcription is completed in AssemblyAI dashboard
  2. Audio Extraction: Ensure video has audio track and extraction was successful
  3. API Limits: Check AssemblyAI API usage and limits
  4. File Format: Verify video format is supported by AssemblyAI
  1. Structured Data: Ensure video has structured transcript JSON data
  2. Polished VTT: Check that polished VTT data exists for generation
  3. Timing Data: Verify timing information is preserved in structured data
  1. Sidekiq Status: Check Sidekiq worker status and queue
  2. Job Logs: Review worker logs for error details
  3. Memory Usage: Monitor system resources during processing
Terminal window
# Check video transcription status
bundle exec rake video:transcription:stats
# Test VTT generation for specific video
bundle exec rake video:vtt:test_generation[VIDEO_ID]
# Process specific video step by step
bundle exec rake video:vtt:retrieve_transcript[VIDEO_ID]
bundle exec rake video:vtt:polish_transcript[VIDEO_ID]
bundle exec rake video:vtt:summarize_video[VIDEO_ID]

Key log entries to monitor:

  • VideoProcessing::TranscriptionService - Transcription service operations
  • VideoTranscriptionWorker - Background job processing
  • AssemblyaiClient - API interaction logs
  • TranscriptionPolisherService - Text correction operations

  1. Batch Processing: Use background jobs for large-scale transcription
  2. Caching: Cache generated VTT content for frequently accessed videos
  3. Resource Management: Monitor API usage and system resources
  4. Error Handling: Implement robust error handling and retry logic
  1. Structured Storage: Use JSONB for flexible structured transcript storage
  2. Backup Strategy: Regular backups of transcription data
  3. Cleanup: Remove temporary files and unused uploads
  4. Validation: Validate transcription quality and completeness
  1. API Keys: Secure storage of AssemblyAI and OpenAI API keys
  2. Access Control: Proper authorization for transcription operations
  3. Data Privacy: Ensure compliance with data protection regulations
  4. Audit Logging: Track transcription operations for security monitoring

  1. Multi-language Support: Caption translation to French (Quebec), Spanish (Mexico), Polish via LLM Gateway
  2. AI-Powered Polishing: LeMUR-based caption polishing with context awareness
  3. Unified AI Pipeline: Single AssemblyAI integration for transcription + AI processing
  1. Advanced Analytics: Detailed transcription analytics and insights
  2. Custom Models: Training custom transcription models for domain-specific content
  3. Real-time Processing: Live transcription for streaming content
  1. Content Management: Integration with CMS for automated content updates
  2. Search Optimization: Enhanced search capabilities using transcript data
  3. Accessibility: Improved accessibility features using transcript data
  4. Analytics: Advanced analytics and reporting capabilities