Class: Embedding::ContentChunker
- Inherits:
-
Object
- Object
- Embedding::ContentChunker
- Defined in:
- app/services/embedding/content_chunker.rb
Overview
Splits long content into embeddable chunks with overlap.
OpenAI's embedding models have token limits (~8191 for text-embedding-3-small).
This service splits content intelligently by paragraph/sentence boundaries.
Constant Summary collapse
- DEFAULT_MAX_CHARS =
~6000 chars ≈ 1500 tokens (safe for 8k token models)
6000- DEFAULT_OVERLAP =
Overlap ensures context isn't lost at chunk boundaries
200
Instance Attribute Summary collapse
-
#content ⇒ Object
readonly
Returns the value of attribute content.
-
#max_chars ⇒ Object
readonly
Returns the value of attribute max_chars.
-
#overlap ⇒ Object
readonly
Returns the value of attribute overlap.
Instance Method Summary collapse
-
#chunk_count ⇒ Integer
Get chunk count without generating chunks.
-
#chunks ⇒ Array<String>
Split content into chunks.
-
#initialize(content, max_chars: DEFAULT_MAX_CHARS, overlap: DEFAULT_OVERLAP) ⇒ ContentChunker
constructor
A new instance of ContentChunker.
-
#needs_chunking? ⇒ Boolean
Check if content needs chunking.
Constructor Details
#initialize(content, max_chars: DEFAULT_MAX_CHARS, overlap: DEFAULT_OVERLAP) ⇒ ContentChunker
Returns a new instance of ContentChunker.
27 28 29 30 31 |
# File 'app/services/embedding/content_chunker.rb', line 27 def initialize(content, max_chars: DEFAULT_MAX_CHARS, overlap: DEFAULT_OVERLAP) @content = content.to_s.strip @max_chars = max_chars @overlap = overlap end |
Instance Attribute Details
#content ⇒ Object (readonly)
Returns the value of attribute content.
22 23 24 |
# File 'app/services/embedding/content_chunker.rb', line 22 def content @content end |
#max_chars ⇒ Object (readonly)
Returns the value of attribute max_chars.
22 23 24 |
# File 'app/services/embedding/content_chunker.rb', line 22 def max_chars @max_chars end |
#overlap ⇒ Object (readonly)
Returns the value of attribute overlap.
22 23 24 |
# File 'app/services/embedding/content_chunker.rb', line 22 def overlap @overlap end |
Instance Method Details
#chunk_count ⇒ Integer
Get chunk count without generating chunks
49 50 51 52 53 54 55 |
# File 'app/services/embedding/content_chunker.rb', line 49 def chunk_count return 1 unless needs_chunking? # Estimate based on content length and overlap effective_chunk_size = max_chars - overlap (content.length.to_f / effective_chunk_size).ceil end |
#chunks ⇒ Array<String>
Split content into chunks
41 42 43 44 45 |
# File 'app/services/embedding/content_chunker.rb', line 41 def chunks return [content] unless needs_chunking? split_into_chunks end |
#needs_chunking? ⇒ Boolean
Check if content needs chunking
35 36 37 |
# File 'app/services/embedding/content_chunker.rb', line 35 def needs_chunking? content.length > max_chars end |