Class: Pdf::Utility::ImageExtractor

Inherits:
Object
  • Object
show all
Defined in:
app/services/pdf/utility/image_extractor.rb

Overview

Extracts images from PDF publications for Vision AI analysis.

Uses HexaPDF to extract embedded images from PDF files.
Filters out small images (icons, bullets) and keeps only significant images
like diagrams, photos, and illustrations.

Examples:

extractor = Pdf::Utility::ImageExtractor.new
images = extractor.extract(item)
images.each { |img| process_with_vision(img[:path]) }

Defined Under Namespace

Classes: Result

Constant Summary collapse

MIN_WIDTH =

Minimum dimensions for an image to be considered significant

100
MIN_HEIGHT =
100
MIN_FILE_SIZE =

Minimum file size in bytes (skip tiny images like spacers)

5_000
MAX_IMAGES =

Maximum images to extract per PDF (avoid processing massive documents)

20

Instance Method Summary collapse

Instance Method Details

#extract(item) ⇒ Result

Returns Result with extracted image paths.

Parameters:

  • item (Item)

    A publication item with literature attached

Returns:

  • (Result)

    Result with extracted image paths



31
32
33
34
35
36
37
38
39
40
41
42
43
# File 'app/services/pdf/utility/image_extractor.rb', line 31

def extract(item)
  return Result.new(success?: false, images: [], error: 'Not a publication') unless item.is_publication?
  return Result.new(success?: false, images: [], error: 'No literature attached') unless item.literature&.attachment
  return Result.new(success?: false, images: [], error: 'File not found') unless File.exist?(pdf_path(item))
  return Result.new(success?: false, images: [], error: 'Not a PDF') unless pdf_file?(item)

  images = extract_images_from_pdf(pdf_path(item), item)
  Result.new(success?: true, images: images, error: nil)
rescue StandardError => e
  Rails.logger.error "[Pdf::Utility::ImageExtractor] Error extracting images: #{e.message}"
  ErrorReporting.error(e)
  Result.new(success?: false, images: [], error: e.message)
end