Class: Pdf::Utility::ImageExtractor

Inherits:

Object

Object
Pdf::Utility::ImageExtractor

show all

Defined in:: app/services/pdf/utility/image_extractor.rb

Overview

Extracts images from PDF publications for Vision AI analysis.

Uses HexaPDF to extract embedded images from PDF files.
Filters out small images (icons, bullets) and keeps only significant images
like diagrams, photos, and illustrations.

Examples:

extractor = Pdf::Utility::ImageExtractor.new
images = extractor.extract(item)
images.each { |img| process_with_vision(img[:path]) }

Defined Under Namespace

Classes: Result

Constant Summary collapse

MIN_WIDTH = Minimum dimensions for an image to be considered significant

MIN_HEIGHT =

MIN_FILE_SIZE = Minimum file size in bytes (skip tiny images like spacers)

5_000

MAX_IMAGES = Maximum images to extract per PDF (avoid processing massive documents)

Instance Method Summary collapse

#extract(item) ⇒ Result
Result with extracted image paths.

Instance Method Details

#extract(item) ⇒ `Result`

Returns Result with extracted image paths.

Parameters:

item (Item) —
A publication item with literature attached

Returns:

(Result) —
Result with extracted image paths

# File 'app/services/pdf/utility/image_extractor.rb', line 31

def extract(item)
  return Result.new(success?: false, images: [], error: 'Not a publication') unless item.is_publication?
  return Result.new(success?: false, images: [], error: 'No literature attached') unless item.literature&.attachment
  return Result.new(success?: false, images: [], error: 'File not found') unless File.exist?(pdf_path(item))
  return Result.new(success?: false, images: [], error: 'Not a PDF') unless pdf_file?(item)

  images = extract_images_from_pdf(pdf_path(item), item)
  Result.new(success?: true, images: images, error: nil)
rescue StandardError => e
  Rails.logger.error "[Pdf::Utility::ImageExtractor] Error extracting images: #{e.message}"
  ErrorReporting.error(e)
  Result.new(success?: false, images: [], error: e.message)
end