Video System Documentation

This document provides comprehensive documentation for the video system, including upload, transcription, processing, and management features.

Overview
Video Upload Process
Video Transcription System
VTT Generation
Rake Tasks
API Integration
UI Components
Troubleshooting

Overview

The video system provides comprehensive video management capabilities including:

Video Upload: Direct creator uploads via Cloudflare Stream
Transcription: High-quality transcription with AssemblyAI
VTT Generation: Dynamic caption generation from structured transcripts
SEO Optimization: Automated metadata generation
Background Processing: Scalable job processing with Sidekiq

Key Components

Video Model: Core data model with structured transcript JSON storage
VideoProcessing::TranscriptionService: Main transcription orchestration
VideoProcessing::VideoTranslationService: Caption translation to FR/ES/PL
TranscriptionPolisherService: Fallback regexp-based text corrections
VideoProcessing::SeoService: AI-powered SEO content generation
AssemblyaiClient: AssemblyAI API integration (transcription + LLM Gateway)
VideoTranscriptionWorker: Background job processing

AI Processing via AssemblyAI LLM Gateway

As of December 2025, all AI processing uses AssemblyAI's LLM Gateway:

Task	Previously	Now
Caption Polishing	Regex only	LeMUR (Claude) + regex fallback
Paragraph Generation	OpenAI GPT-4	LLM Gateway (Claude)
Translation	DeepL API	LLM Gateway (Claude)
SEO Generation	OpenAI GPT-4	OpenAI GPT-4o (unchanged)

This consolidation provides:

Consistent quality: Same AI model for all text processing
Context awareness: LLM understands caption timing and flow
Better translations: Context-aware, preserves brand names
Simpler architecture: Single API for most AI tasks

Video Upload Process

Overview

The video upload process uses Cloudflare Stream's direct creator upload feature, providing a seamless experience from Uppy to Heatwave to Cloudflare.

Sequence Diagram

The upload process follows this sequence:

Uppy Initialization: Client-side uploader setup
Heatwave Processing: Server-side video processing
Cloudflare Storage: Final video storage and streaming

Video Upload Process

Useful Links

Video Transcription System

Overview

The video transcription system provides high-quality transcription with speaker diarization, timestamps, and SEO content generation using AssemblyAI.

Architecture

Services

VideoProcessing::TranscriptionService - Core transcription service with granular methods
TranscriptionPolisherService - Regexp-based text corrections and company terminology
VideoProcessing::SeoService - SEO content generation using RubyLLM (OpenAI)
AssemblyaiClient - Client for interacting with AssemblyAI API
VideoTranscriptionWorker - Background job for comprehensive transcription workflow

Service Responsibilities

AudioExtractionService: Pure audio extraction from file paths (reusable, testable)
VideoProcessing::AudioExtractionService: Video-specific audio extraction with upload storage
VideoProcessing::TranscriptionService: Core transcription logic with granular methods
TranscriptionPolisherService: Fast, reliable regexp-based text corrections
VideoProcessing::SeoService: Generates SEO content using RubyLLM (OpenAI GPT-4o with structured JSON output)
AssemblyaiClient: Handles all AssemblyAI API interactions
VideoTranscriptionWorker: Background job orchestrator with progress tracking

Three-Step Workflow

Step 1: Retrieve Original VTT and Sentences from AssemblyAI

Downloads raw VTT captions from AssemblyAI's /v2/transcript/:transcript_id/vtt endpoint
Retrieves semantically segmented sentences from /v2/transcript/:transcript_id/sentences endpoint
Stores data as vtt_original and sentences in structured_transcript_json
Ensures transcription status is completed before proceeding

Step 2: Polish Transcript and Generate Paragraphs

Uses AssemblyAI LLM Gateway (Claude) for AI-powered polishing:
- Company terminology corrections (e.g., "Warmly Yours" → "WarmlyYours")
- Grammar, punctuation, and typo fixes
- Context-aware corrections that understand caption flow
- Falls back to TranscriptionPolisherService (regex) if LLM fails
Stores polished data as vtt_polished in structured_transcript_json
Uses LLM Gateway to generate natural paragraphs from polished text
Creates HTML transcript for video page display
Saves HTML transcript to video.transcript field

Prompts are configurable via Settings:

video_processing_polish_system_prompt
video_processing_polish_user_prompt
video_processing_paragraph_system_prompt
video_processing_paragraph_user_prompt

Step 3: Generate SEO Metadata

Uses AI to create SEO-friendly content from transcript:
- meta_title (50-60 characters)
- meta_description (150-160 characters)
- sub_header (100-150 characters)
- expanded_description (200-300 words)
Updates video model fields directly

Features

Transcription Options Interface

The system provides a granular transcription options interface that allows users to:

Select specific steps: Choose which parts of the transcription workflow to execute
Configure speaker detection: Set the expected number of speakers (1-10) for improved accuracy, or use "Auto Detect" for automatic speaker detection
Conditional execution: Skip steps that have already been completed
Progress tracking: Monitor job progress with detailed status updates

Speaker Diarization

Automatic speaker detection: Identifies different speakers in the audio
Speaker labeling: Labels speakers as "Speaker A", "Speaker B", etc.
Configurable speaker count: Users can specify expected number of speakers (1-10) for improved accuracy
Speaker statistics: Calculates talk time and word count for each speaker

Structured Data

The service retrieves and stores complete transcript data from AssemblyAI, including:

{
  "id": "transcript_id",
  "status": "completed",
  "confidence": 0.946,
  "audio_duration": 483.2,
  "utterances": [
    {
      "confidence": 0.98,
      "end": 5000,
      "speaker": "A",
      "start": 0,
      "text": "Hello, welcome to our video."
    }
  ]
}

Usage Examples

Basic Transcription

# Initialize service
transcription_service = VideoProcessing::TranscriptionService.new(video)

# Extract audio and submit for transcription
transcription_service.extract_audio
transcription_service.submit_transcription

# Retrieve and process transcript
transcription_service.retrieve_and_overwrite_structured_transcript
transcription_service.polish_transcript_with_company_terminology
transcription_service.summarize_video_and_update_metadata

Background Processing

# Queue transcription job
VideoTranscriptionWorker.perform_async(video.id, options)

# Monitor progress
VideoTranscriptionWorker.new.perform(video.id, options)

VTT Generation

Overview

The system generates VTT (WebVTT) caption files dynamically from the polished structured transcript JSON instead of storing them as uploads. This ensures that captions contain the same corrections and improvements applied to the transcript text.

Problem Solved

Previously, VTT files were retrieved directly from AssemblyAI using raw transcript data and stored as uploads. However, the structured transcript JSON goes through a polishing process that:

Fixes grammar and spelling mistakes
Corrects company terminology (e.g., "Warmly Yours" → "WarmlyYours")
Improves sentence structure and readability

The raw VTT file didn't include these corrections, creating a mismatch between transcript and captions.

Solution

The new system generates VTT captions dynamically on-demand from the structured transcript JSON, ensuring that:

Captions match the polished transcript exactly
Timing information is preserved from the original structured data
Company terminology corrections are applied consistently
VTT files are always current and don't require regeneration

Implementation

Key Methods

VideoProcessing::TranscriptionService#generate_vtt_content_from_structured_transcript: Generates VTT content from structured transcript JSON
VideoProcessing::TranscriptionService#generate_vtt_content_from_polished_vtt: Creates VTT content from polished VTT data
VideosController#download_vtt: Controller action that generates and serves VTT files

Caption Formatting

The system creates captions with:

Timing: Preserves original start/end timestamps from polished VTT data
Text: Uses polished text with company terminology corrections
VTT format: Standard WebVTT format with proper timestamps

Example VTT Output

WEBVTT

1
00:00:00.000 --> 00:00:05.000
Hello, welcome to our video about floor heating systems.

2
00:00:05.000 --> 00:00:10.000
Today we will discuss the benefits of radiant floor heating.

Usage

For New Transcriptions

VTT captions are generated dynamically from polished data when requested.

For Existing Videos

VTT captions are automatically available for any video with structured transcript JSON data containing polished VTT.

Download VTT Files

Navigate to the video show page
Go to the "Transcript" tab
Click "Download Original VTT" or "Download Polished VTT" in the respective panels

Programmatically

# Generate VTT content for a specific video
video = Video.find(video_id)
service = VideoProcessing::TranscriptionService.new(video)
vtt_content = service.generate_vtt_content_from_structured_transcript

# Download VTT file via controller action
# GET /videos/:id/download_vtt?type=original
# GET /videos/:id/download_vtt?type=polished

Rake Tasks

Overview

All video-related rake tasks are consolidated in lib/tasks/video.rake for easy management and organization.

Available Tasks

Transcription Tasks

video:transcription:process[VIDEO_ID] - Process transcription for specific video
video:transcription:process_all - Process all videos without transcripts
video:transcription:process_by_category[CAT] - Process videos by category
video:transcription:process_with_limit[LIMIT] - Process videos with limit
video:transcription:stats - Show transcription statistics

VTT Processing Tasks

video:vtt:retrieve_transcript[VIDEO_ID] - Step 1: Retrieve from AssemblyAI
video:vtt:polish_transcript[VIDEO_ID] - Step 2: Polish with terminology
video:vtt:summarize_video[VIDEO_ID] - Step 3: Generate metadata
video:vtt:test_processing[VIDEO_ID] - Test full workflow
video:vtt:process_all - Process all VTT
video:vtt:extract_and_transcribe - Extract audio & submit for transcription
video:vtt:list_available - List videos with structured data
video:vtt:test_generation[VIDEO_ID] - Test VTT generation

General Tasks

video:stats - Show comprehensive statistics
video:help - Show help message with all available tasks

Usage Examples

# Show all available tasks
bundle exec rake video:help

# Process transcription for specific video
bundle exec rake video:transcription:process[12345]

# Extract audio and submit for transcription
bundle exec rake video:vtt:extract_and_transcribe

# Show video statistics
bundle exec rake video:stats

API Integration

AssemblyAI Integration

The system integrates with AssemblyAI for high-quality transcription services.

Key Features

High Accuracy: Advanced speech recognition with 95%+ accuracy
Speaker Diarization: Automatic speaker identification and labeling
Timestamps: Precise word-level and segment-level timing
Multiple Formats: Support for various audio and video formats

API Endpoints Used

/v2/transcript - Submit transcription jobs
/v2/transcript/:id - Get transcription status and results
/v2/transcript/:id/vtt - Get VTT captions
/v2/transcript/:id/sentences - Get semantically segmented sentences

Configuration

# AssemblyAI client configuration
AssemblyaiClient.new(
  api_key: ENV['ASSEMBLYAI_API_KEY'],
  base_url: 'https://api.assemblyai.com/v2'
)

AssemblyAI LLM Gateway Integration

The system uses AssemblyAI's LLM Gateway for caption polishing, paragraph generation, and translations.

Configuration

All prompts are stored in Setting and editable via the CRM settings page:

Setting	Purpose
`video_processing_llm_model`	LLM model (default: claude-sonnet-4-5-20250929)
`video_processing_llm_max_tokens`	Max tokens (default: 8000)
`video_processing_llm_temperature`	Temperature (default: 0.2)
`video_processing_polish_system_prompt`	Caption polishing system prompt
`video_processing_polish_user_prompt`	Caption polishing user prompt
`video_processing_paragraph_system_prompt`	Paragraph generation system prompt
`video_processing_paragraph_user_prompt`	Paragraph generation user prompt
`video_processing_translate_system_prompt`	Translation system prompt
`video_processing_translate_user_prompt`	Translation user prompt
`transcription_spelling_corrections`	Shared terminology corrections

Usage

# Caption polishing via LLM Gateway
transcription_service = VideoProcessing::TranscriptionService.new(video)
transcription_service.polish_vtt_text(vtt_original) # Returns polished VTT array

# Translation via LLM Gateway
translation_service = VideoProcessing::VideoTranslationService.new(video)
translation_service.translate_vtt_to_locale('fr-CA', 1, 3) # French-Canadian

# Paragraph generation via LLM Gateway
transcription_service.generate_paragraphs_from_polished_text(vtt_polished)

OpenAI Integration

The system uses RubyLLM (configured for OpenAI GPT-4o) for SEO content generation only.

Configuration

The API key is retrieved from Heatwave::Configuration.fetch(:openai, :api_key) and the SEO prompt template is stored in the database via Setting.video_processing_seo_prompt, making it editable through the admin UI.

Usage

# SEO content generation
seo_service = VideoProcessing::SeoService.new(video)
seo_content = seo_service.generate_seo_content
# Returns: { 'status' => 'success', 'sub_header' => '...', 'meta_title' => '...', 
#            'meta_description' => '...', 'expanded_description' => '...' }

# Called automatically during transcription workflow
transcription_service = VideoProcessing::TranscriptionService.new(video)
transcription_service.summarize_video_and_update_metadata

Features

Structured JSON Output: Uses GPT-4o with response_format: { type: 'json_object' } for reliable parsing
Database-Driven Prompts: Editable prompt templates stored in settings
Character Limit Validation: Automatic validation against SEO best practices
Context Preservation: Incorporates existing metadata and video title for consistency

UI Components

Video Player Component

The system includes a reusable video player component for consistent video playback across the application.

Transcript Display

The transcript interface provides:

Structured Data Panels: Separate panels for original VTT, polished VTT, sentences, and paragraphs
Download Options: Direct download links for VTT files and structured data
HTML Preview: Formatted transcript display for video pages
Status Indicators: Real-time status updates for transcription progress

Transcription Options Interface

The transcription options page allows users to:

Select Processing Steps: Choose which transcription steps to execute
Configure Settings: Set speaker detection and other parameters
Monitor Progress: Track job status and completion
View Results: Access generated transcripts and metadata

Troubleshooting

Common Issues

Transcription Failures

Check AssemblyAI Status: Verify transcription is completed in AssemblyAI dashboard
Audio Extraction: Ensure video has audio track and extraction was successful
API Limits: Check AssemblyAI API usage and limits
File Format: Verify video format is supported by AssemblyAI

VTT Generation Issues

Structured Data: Ensure video has structured transcript JSON data
Polished VTT: Check that polished VTT data exists for generation
Timing Data: Verify timing information is preserved in structured data

Background Job Issues

Sidekiq Status: Check Sidekiq worker status and queue
Job Logs: Review worker logs for error details
Memory Usage: Monitor system resources during processing

Debug Commands

# Check video transcription status
bundle exec rake video:transcription:stats

# Test VTT generation for specific video
bundle exec rake video:vtt:test_generation[VIDEO_ID]

# Process specific video step by step
bundle exec rake video:vtt:retrieve_transcript[VIDEO_ID]
bundle exec rake video:vtt:polish_transcript[VIDEO_ID]
bundle exec rake video:vtt:summarize_video[VIDEO_ID]

Log Analysis

Key log entries to monitor:

VideoProcessing::TranscriptionService - Transcription service operations
VideoTranscriptionWorker - Background job processing
AssemblyaiClient - API interaction logs
TranscriptionPolisherService - Text correction operations

Best Practices

Performance Optimization

Batch Processing: Use background jobs for large-scale transcription
Caching: Cache generated VTT content for frequently accessed videos
Resource Management: Monitor API usage and system resources
Error Handling: Implement robust error handling and retry logic

Data Management

Structured Storage: Use JSONB for flexible structured transcript storage
Backup Strategy: Regular backups of transcription data
Cleanup: Remove temporary files and unused uploads
Validation: Validate transcription quality and completeness

Security Considerations

API Keys: Secure storage of AssemblyAI and OpenAI API keys
Access Control: Proper authorization for transcription operations
Data Privacy: Ensure compliance with data protection regulations
Audit Logging: Track transcription operations for security monitoring

Future Enhancements

Completed Features (December 2025)

Multi-language Support: Caption translation to French (Quebec), Spanish (Mexico), Polish via LLM Gateway
AI-Powered Polishing: LeMUR-based caption polishing with context awareness
Unified AI Pipeline: Single AssemblyAI integration for transcription + AI processing

Planned Features

Advanced Analytics: Detailed transcription analytics and insights
Custom Models: Training custom transcription models for domain-specific content
Real-time Processing: Live transcription for streaming content

Integration Opportunities

Content Management: Integration with CMS for automated content updates
Search Optimization: Enhanced search capabilities using transcript data
Accessibility: Improved accessibility features using transcript data
Analytics: Advanced analytics and reporting capabilities

Video System Documentation

Table of Contents

Overview

Key Components

AI Processing via AssemblyAI LLM Gateway

Video Upload Process

Overview

Sequence Diagram

Useful Links

Video Transcription System

Overview

Architecture

Services

Service Responsibilities

Three-Step Workflow

Step 1: Retrieve Original VTT and Sentences from AssemblyAI

Step 2: Polish Transcript and Generate Paragraphs

Step 3: Generate SEO Metadata

Features

Transcription Options Interface

Speaker Diarization

Structured Data

Usage Examples

Basic Transcription

Background Processing

VTT Generation

Overview

Problem Solved

Solution

Implementation

Key Methods

Caption Formatting

Example VTT Output

Usage

For New Transcriptions

For Existing Videos

Download VTT Files

Programmatically

Rake Tasks

Overview

Available Tasks

Transcription Tasks

VTT Processing Tasks

General Tasks

Usage Examples

API Integration

AssemblyAI Integration

Key Features

API Endpoints Used

Configuration

AssemblyAI LLM Gateway Integration

Configuration

Usage

OpenAI Integration

Configuration

Usage

Features

UI Components

Video Player Component

Transcript Display

Transcription Options Interface

Troubleshooting

Common Issues

Transcription Failures

VTT Generation Issues

Background Job Issues

Debug Commands

Log Analysis

Best Practices

Performance Optimization

Data Management

Security Considerations

Future Enhancements

Completed Features (December 2025)

Planned Features

Integration Opportunities