Video System Documentation

This document provides comprehensive documentation for the video system, including upload, transcription, processing, and management features.

Table of Contents


Overview

The video system provides comprehensive video management capabilities including:

  • Video Upload: Direct creator uploads via Cloudflare Stream
  • Transcription: High-quality transcription with AssemblyAI
  • VTT Generation: Dynamic caption generation from structured transcripts
  • SEO Optimization: Automated metadata generation
  • Background Processing: Scalable job processing with Sidekiq

Key Components

  • Video Model: Core data model with structured transcript JSON storage
  • VideoProcessing::TranscriptionService: Main transcription orchestration
  • VideoProcessing::VideoTranslationService: Caption translation to FR/ES/PL
  • TranscriptionPolisherService: Fallback regexp-based text corrections
  • VideoProcessing::SeoService: AI-powered SEO content generation
  • AssemblyaiClient: AssemblyAI API integration (transcription + LLM Gateway)
  • VideoTranscriptionWorker: Background job processing

AI Processing via AssemblyAI LLM Gateway

As of December 2025, all AI processing uses AssemblyAI's LLM Gateway:

Task Previously Now
Caption Polishing Regex only LeMUR (Claude) + regex fallback
Paragraph Generation OpenAI GPT-4 LLM Gateway (Claude)
Translation DeepL API LLM Gateway (Claude)
SEO Generation OpenAI GPT-4 OpenAI GPT-4o (unchanged)

This consolidation provides:

  • Consistent quality: Same AI model for all text processing
  • Context awareness: LLM understands caption timing and flow
  • Better translations: Context-aware, preserves brand names
  • Simpler architecture: Single API for most AI tasks

Video Upload Process

Overview

The video upload process uses Cloudflare Stream's direct creator upload feature, providing a seamless experience from Uppy to Heatwave to Cloudflare.

Sequence Diagram

The upload process follows this sequence:

  1. Uppy Initialization: Client-side uploader setup
  2. Heatwave Processing: Server-side video processing
  3. Cloudflare Storage: Final video storage and streaming

Video Upload Process

Useful Links


Video Transcription System

Overview

The video transcription system provides high-quality transcription with speaker diarization, timestamps, and SEO content generation using AssemblyAI.

Architecture

Services

  1. VideoProcessing::TranscriptionService - Core transcription service with granular methods
  2. TranscriptionPolisherService - Regexp-based text corrections and company terminology
  3. VideoProcessing::SeoService - SEO content generation using RubyLLM (OpenAI)
  4. AssemblyaiClient - Client for interacting with AssemblyAI API
  5. VideoTranscriptionWorker - Background job for comprehensive transcription workflow

Service Responsibilities

  • AudioExtractionService: Pure audio extraction from file paths (reusable, testable)
  • VideoProcessing::AudioExtractionService: Video-specific audio extraction with upload storage
  • VideoProcessing::TranscriptionService: Core transcription logic with granular methods
  • TranscriptionPolisherService: Fast, reliable regexp-based text corrections
  • VideoProcessing::SeoService: Generates SEO content using RubyLLM (OpenAI GPT-4o with structured JSON output)
  • AssemblyaiClient: Handles all AssemblyAI API interactions
  • VideoTranscriptionWorker: Background job orchestrator with progress tracking

Three-Step Workflow

Step 1: Retrieve Original VTT and Sentences from AssemblyAI

  • Downloads raw VTT captions from AssemblyAI's /v2/transcript/:transcript_id/vtt endpoint
  • Retrieves semantically segmented sentences from /v2/transcript/:transcript_id/sentences endpoint
  • Stores data as vtt_original and sentences in structured_transcript_json
  • Ensures transcription status is completed before proceeding

Step 2: Polish Transcript and Generate Paragraphs

  • Uses AssemblyAI LLM Gateway (Claude) for AI-powered polishing:
    • Company terminology corrections (e.g., "Warmly Yours" → "WarmlyYours")
    • Grammar, punctuation, and typo fixes
    • Context-aware corrections that understand caption flow
    • Falls back to TranscriptionPolisherService (regex) if LLM fails
  • Stores polished data as vtt_polished in structured_transcript_json
  • Uses LLM Gateway to generate natural paragraphs from polished text
  • Creates HTML transcript for video page display
  • Saves HTML transcript to video.transcript field

Prompts are configurable via Settings:

  • video_processing_polish_system_prompt
  • video_processing_polish_user_prompt
  • video_processing_paragraph_system_prompt
  • video_processing_paragraph_user_prompt

Step 3: Generate SEO Metadata

  • Uses AI to create SEO-friendly content from transcript:
    • meta_title (50-60 characters)
    • meta_description (150-160 characters)
    • sub_header (100-150 characters)
    • expanded_description (200-300 words)
  • Updates video model fields directly

Features

Transcription Options Interface

The system provides a granular transcription options interface that allows users to:

  • Select specific steps: Choose which parts of the transcription workflow to execute
  • Configure speaker detection: Set the expected number of speakers (1-10) for improved accuracy, or use "Auto Detect" for automatic speaker detection
  • Conditional execution: Skip steps that have already been completed
  • Progress tracking: Monitor job progress with detailed status updates

Speaker Diarization

  • Automatic speaker detection: Identifies different speakers in the audio
  • Speaker labeling: Labels speakers as "Speaker A", "Speaker B", etc.
  • Configurable speaker count: Users can specify expected number of speakers (1-10) for improved accuracy
  • Speaker statistics: Calculates talk time and word count for each speaker

Structured Data

The service retrieves and stores complete transcript data from AssemblyAI, including:

{
  "id": "transcript_id",
  "status": "completed",
  "confidence": 0.946,
  "audio_duration": 483.2,
  "utterances": [
    {
      "confidence": 0.98,
      "end": 5000,
      "speaker": "A",
      "start": 0,
      "text": "Hello, welcome to our video."
    }
  ]
}

Usage Examples

Basic Transcription

# Initialize service
transcription_service = VideoProcessing::TranscriptionService.new(video)

# Extract audio and submit for transcription
transcription_service.extract_audio
transcription_service.submit_transcription

# Retrieve and process transcript
transcription_service.retrieve_and_overwrite_structured_transcript
transcription_service.polish_transcript_with_company_terminology
transcription_service.

Background Processing

# Queue transcription job
VideoTranscriptionWorker.perform_async(video.id, options)

# Monitor progress
VideoTranscriptionWorker.new.perform(video.id, options)

VTT Generation

Overview

The system generates VTT (WebVTT) caption files dynamically from the polished structured transcript JSON instead of storing them as uploads. This ensures that captions contain the same corrections and improvements applied to the transcript text.

Problem Solved

Previously, VTT files were retrieved directly from AssemblyAI using raw transcript data and stored as uploads. However, the structured transcript JSON goes through a polishing process that:

  1. Fixes grammar and spelling mistakes
  2. Corrects company terminology (e.g., "Warmly Yours" → "WarmlyYours")
  3. Improves sentence structure and readability

The raw VTT file didn't include these corrections, creating a mismatch between transcript and captions.

Solution

The new system generates VTT captions dynamically on-demand from the structured transcript JSON, ensuring that:

  • Captions match the polished transcript exactly
  • Timing information is preserved from the original structured data
  • Company terminology corrections are applied consistently
  • VTT files are always current and don't require regeneration

Implementation

Key Methods

  • VideoProcessing::TranscriptionService#generate_vtt_content_from_structured_transcript: Generates VTT content from structured transcript JSON
  • VideoProcessing::TranscriptionService#generate_vtt_content_from_polished_vtt: Creates VTT content from polished VTT data
  • VideosController#download_vtt: Controller action that generates and serves VTT files

Caption Formatting

The system creates captions with:

  • Timing: Preserves original start/end timestamps from polished VTT data
  • Text: Uses polished text with company terminology corrections
  • VTT format: Standard WebVTT format with proper timestamps

Example VTT Output

WEBVTT

1
00:00:00.000 --> 00:00:05.000
Hello, welcome to our video about floor heating systems.

2
00:00:05.000 --> 00:00:10.000
Today we will discuss the benefits of radiant floor heating.

Usage

For New Transcriptions

VTT captions are generated dynamically from polished data when requested.

For Existing Videos

VTT captions are automatically available for any video with structured transcript JSON data containing polished VTT.

Download VTT Files

  1. Navigate to the video show page
  2. Go to the "Transcript" tab
  3. Click "Download Original VTT" or "Download Polished VTT" in the respective panels

Programmatically

# Generate VTT content for a specific video
video = Video.find(video_id)
service = VideoProcessing::TranscriptionService.new(video)
vtt_content = service.generate_vtt_content_from_structured_transcript

# Download VTT file via controller action
# GET /videos/:id/download_vtt?type=original
# GET /videos/:id/download_vtt?type=polished

Rake Tasks

Overview

All video-related rake tasks are consolidated in lib/tasks/video.rake for easy management and organization.

Available Tasks

Transcription Tasks

  • video:transcription:process[VIDEO_ID] - Process transcription for specific video
  • video:transcription:process_all - Process all videos without transcripts
  • video:transcription:process_by_category[CAT] - Process videos by category
  • video:transcription:process_with_limit[LIMIT] - Process videos with limit
  • video:transcription:stats - Show transcription statistics

VTT Processing Tasks

  • video:vtt:retrieve_transcript[VIDEO_ID] - Step 1: Retrieve from AssemblyAI
  • video:vtt:polish_transcript[VIDEO_ID] - Step 2: Polish with terminology
  • video:vtt:summarize_video[VIDEO_ID] - Step 3: Generate metadata
  • video:vtt:test_processing[VIDEO_ID] - Test full workflow
  • video:vtt:process_all - Process all VTT
  • video:vtt:extract_and_transcribe - Extract audio & submit for transcription
  • video:vtt:list_available - List videos with structured data
  • video:vtt:test_generation[VIDEO_ID] - Test VTT generation

General Tasks

  • video:stats - Show comprehensive statistics
  • video:help - Show help message with all available tasks

Usage Examples

# Show all available tasks
bundle exec rake video:help

# Process transcription for specific video
bundle exec rake video:transcription:process[12345]

# Extract audio and submit for transcription
bundle exec rake video:vtt:extract_and_transcribe

# Show video statistics
bundle exec rake video:stats

API Integration

AssemblyAI Integration

The system integrates with AssemblyAI for high-quality transcription services.

Key Features

  • High Accuracy: Advanced speech recognition with 95%+ accuracy
  • Speaker Diarization: Automatic speaker identification and labeling
  • Timestamps: Precise word-level and segment-level timing
  • Multiple Formats: Support for various audio and video formats

API Endpoints Used

  • /v2/transcript - Submit transcription jobs
  • /v2/transcript/:id - Get transcription status and results
  • /v2/transcript/:id/vtt - Get VTT captions
  • /v2/transcript/:id/sentences - Get semantically segmented sentences

Configuration

# AssemblyAI client configuration
AssemblyaiClient.new(
  api_key: ENV['ASSEMBLYAI_API_KEY'],
  base_url: 'https://api.assemblyai.com/v2'
)

AssemblyAI LLM Gateway Integration

The system uses AssemblyAI's LLM Gateway for caption polishing, paragraph generation, and translations.

Configuration

All prompts are stored in Setting and editable via the CRM settings page:

Setting Purpose
video_processing_llm_model LLM model (default: claude-sonnet-4-5-20250929)
video_processing_llm_max_tokens Max tokens (default: 8000)
video_processing_llm_temperature Temperature (default: 0.2)
video_processing_polish_system_prompt Caption polishing system prompt
video_processing_polish_user_prompt Caption polishing user prompt
video_processing_paragraph_system_prompt Paragraph generation system prompt
video_processing_paragraph_user_prompt Paragraph generation user prompt
video_processing_translate_system_prompt Translation system prompt
video_processing_translate_user_prompt Translation user prompt
transcription_spelling_corrections Shared terminology corrections

Usage

# Caption polishing via LLM Gateway
transcription_service = VideoProcessing::TranscriptionService.new(video)
transcription_service.polish_vtt_text(vtt_original) # Returns polished VTT array

# Translation via LLM Gateway
translation_service = VideoProcessing::VideoTranslationService.new(video)
translation_service.translate_vtt_to_locale('fr-CA', 1, 3) # French-Canadian

# Paragraph generation via LLM Gateway
transcription_service.generate_paragraphs_from_polished_text(vtt_polished)

OpenAI Integration

The system uses RubyLLM (configured for OpenAI GPT-4o) for SEO content generation only.

Configuration

The API key is retrieved from Heatwave::Configuration.fetch(:openai, :api_key) and the SEO prompt template is stored in the database via Setting.video_processing_seo_prompt, making it editable through the admin UI.

Usage

# SEO content generation
seo_service = VideoProcessing::SeoService.new(video)
seo_content = seo_service.generate_seo_content
# Returns: { 'status' => 'success', 'sub_header' => '...', 'meta_title' => '...', 
#            'meta_description' => '...', 'expanded_description' => '...' }

# Called automatically during transcription workflow
transcription_service = VideoProcessing::TranscriptionService.new(video)
transcription_service.

Features

  • Structured JSON Output: Uses GPT-4o with response_format: { type: 'json_object' } for reliable parsing
  • Database-Driven Prompts: Editable prompt templates stored in settings
  • Character Limit Validation: Automatic validation against SEO best practices
  • Context Preservation: Incorporates existing metadata and video title for consistency

UI Components

Video Player Component

The system includes a reusable video player component for consistent video playback across the application.

Transcript Display

The transcript interface provides:

  • Structured Data Panels: Separate panels for original VTT, polished VTT, sentences, and paragraphs
  • Download Options: Direct download links for VTT files and structured data
  • HTML Preview: Formatted transcript display for video pages
  • Status Indicators: Real-time status updates for transcription progress

Transcription Options Interface

The transcription options page allows users to:

  • Select Processing Steps: Choose which transcription steps to execute
  • Configure Settings: Set speaker detection and other parameters
  • Monitor Progress: Track job status and completion
  • View Results: Access generated transcripts and metadata

Troubleshooting

Common Issues

Transcription Failures

  1. Check AssemblyAI Status: Verify transcription is completed in AssemblyAI dashboard
  2. Audio Extraction: Ensure video has audio track and extraction was successful
  3. API Limits: Check AssemblyAI API usage and limits
  4. File Format: Verify video format is supported by AssemblyAI

VTT Generation Issues

  1. Structured Data: Ensure video has structured transcript JSON data
  2. Polished VTT: Check that polished VTT data exists for generation
  3. Timing Data: Verify timing information is preserved in structured data

Background Job Issues

  1. Sidekiq Status: Check Sidekiq worker status and queue
  2. Job Logs: Review worker logs for error details
  3. Memory Usage: Monitor system resources during processing

Debug Commands

# Check video transcription status
bundle exec rake video:transcription:stats

# Test VTT generation for specific video
bundle exec rake video:vtt:test_generation[VIDEO_ID]

# Process specific video step by step
bundle exec rake video:vtt:retrieve_transcript[VIDEO_ID]
bundle exec rake video:vtt:polish_transcript[VIDEO_ID]
bundle exec rake video:vtt:summarize_video[VIDEO_ID]

Log Analysis

Key log entries to monitor:

  • VideoProcessing::TranscriptionService - Transcription service operations
  • VideoTranscriptionWorker - Background job processing
  • AssemblyaiClient - API interaction logs
  • TranscriptionPolisherService - Text correction operations

Best Practices

Performance Optimization

  1. Batch Processing: Use background jobs for large-scale transcription
  2. Caching: Cache generated VTT content for frequently accessed videos
  3. Resource Management: Monitor API usage and system resources
  4. Error Handling: Implement robust error handling and retry logic

Data Management

  1. Structured Storage: Use JSONB for flexible structured transcript storage
  2. Backup Strategy: Regular backups of transcription data
  3. Cleanup: Remove temporary files and unused uploads
  4. Validation: Validate transcription quality and completeness

Security Considerations

  1. API Keys: Secure storage of AssemblyAI and OpenAI API keys
  2. Access Control: Proper authorization for transcription operations
  3. Data Privacy: Ensure compliance with data protection regulations
  4. Audit Logging: Track transcription operations for security monitoring

Future Enhancements

Completed Features (December 2025)

  1. Multi-language Support: Caption translation to French (Quebec), Spanish (Mexico), Polish via LLM Gateway
  2. AI-Powered Polishing: LeMUR-based caption polishing with context awareness
  3. Unified AI Pipeline: Single AssemblyAI integration for transcription + AI processing

Planned Features

  1. Advanced Analytics: Detailed transcription analytics and insights
  2. Custom Models: Training custom transcription models for domain-specific content
  3. Real-time Processing: Live transcription for streaming content

Integration Opportunities

  1. Content Management: Integration with CMS for automated content updates
  2. Search Optimization: Enhanced search capabilities using transcript data
  3. Accessibility: Improved accessibility features using transcript data
  4. Analytics: Advanced analytics and reporting capabilities