Class: Embedding::ContentChunker

Inherits:
Object
  • Object
show all
Defined in:
app/services/embedding/content_chunker.rb

Overview

Splits long content into embeddable chunks with overlap.
OpenAI's embedding models have token limits (~8191 for text-embedding-3-small).
This service splits content intelligently by paragraph/sentence boundaries.

Examples:

Basic usage

chunker = Embedding::ContentChunker.new(long_text)
chunker.chunks.each { |chunk| embed(chunk) }

With custom settings

chunker = Embedding::ContentChunker.new(long_text, max_chars: 2000, overlap: 100)

Constant Summary collapse

DEFAULT_MAX_CHARS =

~6000 chars ≈ 1500 tokens (safe for 8k token models)

6000
DEFAULT_OVERLAP =

Overlap ensures context isn't lost at chunk boundaries

200

Instance Attribute Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(content, max_chars: DEFAULT_MAX_CHARS, overlap: DEFAULT_OVERLAP) ⇒ ContentChunker

Returns a new instance of ContentChunker.

Parameters:

  • content (String)

    The text to chunk

  • max_chars (Integer) (defaults to: DEFAULT_MAX_CHARS)

    Maximum characters per chunk

  • overlap (Integer) (defaults to: DEFAULT_OVERLAP)

    Characters to overlap between chunks



27
28
29
30
31
# File 'app/services/embedding/content_chunker.rb', line 27

def initialize(content, max_chars: DEFAULT_MAX_CHARS, overlap: DEFAULT_OVERLAP)
  @content = content.to_s.strip
  @max_chars = max_chars
  @overlap = overlap
end

Instance Attribute Details

#contentObject (readonly)

Returns the value of attribute content.



22
23
24
# File 'app/services/embedding/content_chunker.rb', line 22

def content
  @content
end

#max_charsObject (readonly)

Returns the value of attribute max_chars.



22
23
24
# File 'app/services/embedding/content_chunker.rb', line 22

def max_chars
  @max_chars
end

#overlapObject (readonly)

Returns the value of attribute overlap.



22
23
24
# File 'app/services/embedding/content_chunker.rb', line 22

def overlap
  @overlap
end

Instance Method Details

#chunk_countInteger

Get chunk count without generating chunks

Returns:

  • (Integer)

    Estimated number of chunks



49
50
51
52
53
54
55
# File 'app/services/embedding/content_chunker.rb', line 49

def chunk_count
  return 1 unless needs_chunking?

  # Estimate based on content length and overlap
  effective_chunk_size = max_chars - overlap
  (content.length.to_f / effective_chunk_size).ceil
end

#chunksArray<String>

Split content into chunks

Returns:

  • (Array<String>)

    Array of content chunks



41
42
43
44
45
# File 'app/services/embedding/content_chunker.rb', line 41

def chunks
  return [content] unless needs_chunking?

  split_into_chunks
end

#needs_chunking?Boolean

Check if content needs chunking

Returns:

  • (Boolean)

    true if content exceeds max_chars



35
36
37
# File 'app/services/embedding/content_chunker.rb', line 35

def needs_chunking?
  content.length > max_chars
end