Class: Assistant::ContextCompactor

Inherits:
Object
  • Object
show all
Defined in:
app/services/assistant/context_compactor.rb

Overview

Two-layer context compaction to keep AI assistant conversations within
token budgets and reduce cost.

Strategy 1 — Ephemeral Tool Result Compaction (post-exchange):
After a complete exchange, replace ALL large tool result messages with
compact summaries. Tool results (blog HTML, SQL output, API JSON) are
only needed verbatim during the exchange that consumed them — once the
response is streamed they are stale. Keeping them verbatim forces every
subsequent API call to re-send the same payload, bloating context rapidly.
This mirrors how Cursor treats file reads: ephemeral per-request injection,
not accumulated history.

Strategy 2 — Sliding Window + Summary (pre-exchange):
When estimated context exceeds a threshold, compress older messages
into a cached summary and keep only recent messages verbatim.

Usage:

After a response completes (in the worker):

Assistant::ContextCompactor.compact_tool_results!(conversation)

Before building the LLM payload (in to_llm):

Assistant::ContextCompactor.ensure_context_summary!(conversation)

Lifecycle hooks (compact / summarize / ensure / fork / emergency_compact)
all share the same conversation scan + tool-call/message classification,
so splitting would scatter knowledge of the message lifecycle. The
exempt-list of schema-discovery suffixes also has to live next to the
truncation routine that consults it.
rubocop:disable Metrics/ClassLength

Constant Summary collapse

TOOL_RESULT_CHAR_THRESHOLD =

── Thresholds ────────────────────────────────────────────────────
Tool results shorter than this (chars) are left as-is.

1_500
SCHEMA_DISCOVERY_TOOL_SUFFIXES =

Schema-discovery results (column lists, table summaries) are referenced
repeatedly across tool calls in the same turn — Sunny needs to know what
columns exist on view_opportunities / view_activities every time it writes
SQL, not just the first time. Truncating these to 800 chars stripped the
answer from context and led to 290+ guessed-column SQL failures over a
10-day window. Exempt them from both the verbatim-truncation pass and the
repeated-call dedup so the column list stays in scope for the whole turn.

%w[
  _describe_available_data
  _get_object_details
  _list_objects
  _list_schemas
].freeze
CONTEXT_TOKEN_THRESHOLD =

When estimated context exceeds this many tokens, trigger sliding-window.
When actual input_tokens from the API are available (preferred over char
estimation), this threshold is applied directly against those real counts.
Kept conservatively low so that tool-definition overhead + mid-flight
tool-call additions don't push the next request over the 200k limit.

50_000
MAX_MESSAGES_BEFORE_SUMMARY =

Also trigger sliding-window when the raw message count crosses this threshold,
regardless of token estimates. Large-context models (Gemini 1M) never hit the
token ceiling naturally, so we need a count-based safety valve. 40 messages ≈
5-8 user turns with tool activity — enough history for coherent continuity.

40
MIN_RECENT_MESSAGES =

Keep at least this many messages verbatim in the recent window.
Ensures the LLM always sees enough context for coherent follow-ups.
6 messages ≈ 2 user/assistant exchanges — sufficient for continuity.
Combined with immediate tool-result compaction, these messages are short.

6
SUMMARIZER_MODEL =

Summarizer model.

AiModelConstants.id(:summarization)
CHARS_PER_TOKEN =

Rough chars-per-token ratio. Conservative (real ratio is ~3.5 for
English) so we trigger compaction a bit early rather than too late.

4
MAX_FORK_DEPTH =

Maximum number of parent→child forks before we refuse to fork again.
Prevents infinite cascade: conv → cont → cont → cont → …

3
OVERHEAD_TOKENS =

Check whether the conversation needs a sliding-window summary.
If context exceeds the threshold and the cached summary is stale
(or absent), generate a new one.

Returns the summary text (or nil if compaction is not needed).
The caller (to_llm) uses this to decide whether to inject the
summary and truncate old messages.

param conversation [AssistantConversation]
Estimated token overhead from system prompt and tool schemas that is NOT
reflected in stored message content. Conservative estimate based on typical
Sunny configurations: ~3K system prompt + ~2-15K per tool service.

Returns:

  • (String, nil)

    summary text, or nil if not needed

15_000
DEDUP_STUB =

Collapse repeated identical tool calls across all turns.

When the same tool+args combination appears N times in history (e.g. the model
called get_blog_post(868) in 6 different turns), the context re-sends the same
data on every subsequent API call. This replaces the content of all but the
MOST RECENT result for each unique (tool_name, arguments) signature with a
lightweight stub, dramatically reducing context for chatty read-only tools.

Skips messages already stubbed (content starts with "[Already retrieved").
Safe: the latest result is always preserved so the model can reference it.

'[Already retrieved earlier — omitted to reduce context]'
MIN_CALLS_TO_DEDUP =

only dedup when ≥ this many identical calls exist

2
TRUNCATION_KEEP_CHARS =

Truncation threshold: keep this many leading chars from the tool result.
Enough to preserve key data points (IDs, titles, status, error messages)
without carrying full blog HTML or SQL result sets into future turns.

800
CONTEXT_OVERFLOW_PATTERNS =

Regex patterns that identify a provider-level context-length error across
Anthropic, OpenAI, and Gemini. RubyLLM raises BadRequestError (400) for all.

[
  /prompt is too long/i,
  /maximum context length/i,
  /context_length_exceeded/i,
  /exceeds.*context.*window/i,
  /input.*too long/i,
  /too many tokens/i,
  /tokens.*exceed/i,
  /prompt.*exceeds.*limit/i,
  /reduce.*size.*message/i
].freeze

Class Method Summary collapse

Class Method Details

.compact_tool_results!(conversation) ⇒ Object

Summarize large tool-result messages from the most recent exchange.
Called after finalize_response in the worker so the assistant has
already consumed the raw data.

Parameters:



100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
# File 'app/services/assistant/context_compactor.rb', line 100

def self.compact_tool_results!(conversation)
  tool_messages = large_tool_results(conversation)

  unless tool_messages.empty?
    Rails.logger.info do
      "[ContextCompactor] Compacting #{tool_messages.size} tool result(s) " \
        "for conversation #{conversation.id}"
    end
    tool_messages.each { |msg| summarize_tool_result!(msg) }
  end

  # Deduplicate: when the same tool+args has been called multiple times across
  # turns, stub out older results so the context doesn't keep re-sending the
  # same data. The most recent result is preserved — it reflects current state.
  deduplicate_repeated_tool_calls!(conversation)
rescue StandardError => e
  # Compaction is best-effort — never break the main flow.
  Rails.logger.warn("[ContextCompactor] Tool result compaction failed: #{e.message}")
end

.compaction_cutoff_id(conversation) ⇒ Integer?

Return the message ID through which the cached summary covers.
Used by to_llm to filter old messages.

Parameters:

Returns:

  • (Integer, nil)


236
237
238
# File 'app/services/assistant/context_compactor.rb', line 236

def self.compaction_cutoff_id(conversation)
  conversation.compaction_through_message_id
end

.context_overflow?(error) ⇒ Boolean

Returns true when the error message matches a known context-overflow pattern.

Parameters:

  • error (StandardError)

Returns:

  • (Boolean)


697
698
699
700
# File 'app/services/assistant/context_compactor.rb', line 697

def self.context_overflow?(error)
  message_text = error.message.to_s
  CONTEXT_OVERFLOW_PATTERNS.any? { |pattern| message_text.match?(pattern) }
end

.emergency_compact!(conversation, level: 1) ⇒ Boolean

── Strategy 3: Emergency Compaction (context overflow recovery) ──

Called when the LLM rejects a request due to context length. Performs
progressively more aggressive compaction and returns true if the context
was reduced (caller should retry the LLM call).

Levels:

  1. Force-compact all tool results (including small ones from current turn)
    and regenerate the sliding-window summary with a tighter recent window.
  2. Nuclear option — summarize everything except the last 2 messages,
    compact ALL tool results regardless of size.

Parameters:

Returns:

  • (Boolean)

    true if compaction was performed (caller should retry)



434
435
436
437
438
439
440
441
442
443
# File 'app/services/assistant/context_compactor.rb', line 434

def self.emergency_compact!(conversation, level: 1)
  case level
  when 1 then emergency_compact_level_one!(conversation)
  when 2 then emergency_compact_level_two!(conversation)
  else false
  end
rescue StandardError => e
  Rails.logger.warn("[ContextCompactor] Emergency compaction level #{level} failed: #{e.message}")
  false
end

.ensure_context_summary!(conversation) ⇒ Object



137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
# File 'app/services/assistant/context_compactor.rb', line 137

def self.ensure_context_summary!(conversation)
  message_rows = sliding_window_message_rows(conversation)
  return nil unless sliding_window_should_summarize?(conversation, message_rows)

  split_index = sliding_window_split_index(message_rows)
  return nil if split_index <= 0

  split_message_id = message_rows[split_index - 1][0] # last message included in "old"
  cached = sliding_window_cached_summary(conversation, split_message_id)
  return cached if cached

  summary = generate_conversation_summary(message_rows[0...split_index], conversation: conversation)
  sliding_window_persist!(conversation, summary, split_message_id)

  summary
rescue StandardError => e
  Rails.logger.warn("[ContextCompactor] Sliding window summary failed: #{e.message}")
  nil
end

.fork_continuation!(conversation, pending_user_message: nil, tool_services: [], summarize: true) ⇒ AssistantConversation?

── Strategy 4: Conversation Fork (manual / power-user) ───────

Creates a new "continuation" conversation with the same owner, carrying a
compact summary of the full prior history as injected context. Used by
manual "Continue with context" and "Branch from here" actions.

NOTE: This is NOT used for automatic context overflow recovery — that is
handled by emergency_compact! + retry. Forking is a deliberate user action.

Parameters:

  • conversation (AssistantConversation)

    The source conversation

  • pending_user_message (String, nil) (defaults to: nil)

    The message to carry forward

  • tool_services (Array<String>) (defaults to: [])

    Tool services to carry forward

  • summarize (Boolean) (defaults to: true)

    When true (default) generates the context summary synchronously.
    Pass false for a fast, redirect-friendly fork — the worker will generate the summary
    before its first LLM call via +generate_parent_summary!+.

Returns:



534
535
536
537
538
539
540
541
542
543
544
545
546
# File 'app/services/assistant/context_compactor.rb', line 534

def self.fork_continuation!(conversation, pending_user_message: nil, tool_services: [], summarize: true)
  return nil if fork_depth_exceeded?(conversation)

  summary = fork_continuation_summary(conversation, summarize: summarize)
  continuation = build_fork_record(conversation, suffix: '(cont.)', summary: summary, tool_services: tool_services)
  continuation.save!

  log_fork(conversation, continuation, summary: summary, pending: pending_user_message.present?)
  continuation
rescue StandardError => e
  Rails.logger.error("[ContextCompactor] fork_continuation! failed: #{e.message}")
  nil
end

.fork_from_message!(conversation, message_id:, summarize: true) ⇒ Array(AssistantConversation, String)?

Fork a new conversation branching from a specific user message.

Parameters:

  • conversation (AssistantConversation)

    Source conversation

  • message_id (Integer)

    ID of the AssistantMessage to branch from (must be role 'user')

  • summarize (Boolean) (defaults to: true)

    When true (default) generates the context summary synchronously.
    Pass false for a fast, redirect-friendly fork — the worker will generate the summary
    before its first LLM call via +generate_parent_summary!+.

Returns:



597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
# File 'app/services/assistant/context_compactor.rb', line 597

def self.fork_from_message!(conversation, message_id:, summarize: true)
  fork_msg = conversation.assistant_messages.where(role: 'user').find_by(id: message_id)
  return nil unless fork_msg

  summary = summarize ? fork_from_message_prior_summary(conversation, fork_msg) : nil
  fork_convo = build_fork_record(conversation, suffix: '(fork)', summary: summary,
                                               tool_services: conversation.tool_services)
  # Store the fork-point so the worker can summarise the right slice
  # of the parent conversation when summarize: false is used.
  fork_convo. = fork_convo..merge('fork_message_id' => fork_msg.id)
  fork_convo.save!

  log_fork(conversation, fork_convo, summary: summary)
  [fork_convo, fork_msg.content.to_s]
rescue StandardError => e
  Rails.logger.error("[ContextCompactor] fork_from_message! failed: #{e.message}")
  nil
end

.generate_parent_summary!(conversation) ⇒ String?

Generate and persist the context summary for a forked conversation whose summary
was deferred at fork time (fork_from_message! with summarize: false).

Reads prior messages from the parent conversation up to the stored fork_message_id,
generates the LLM summary, and persists it directly via a JSONB merge so concurrent
writes (e.g. the AssistantChatWorker token-count sync) cannot clobber it.

Safe to call multiple times — no-op if the summary already exists.

Parameters:

Returns:

  • (String, nil)

    The generated summary, or nil if skipped / failed



639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
# File 'app/services/assistant/context_compactor.rb', line 639

def self.generate_parent_summary!(conversation)
  return nil if conversation.parent_conversation_summary.present?

  parent = AssistantConversation.find_by(id: conversation.parent_conversation_id)
  return nil unless parent

  prior_msgs = parent_summary_prior_msgs(conversation, parent)
  return nil if prior_msgs.empty?

  summary = generate_conversation_summary(prior_msgs, conversation: parent)
  return nil if summary.blank?

  persist_parent_summary!(conversation, summary)
  summary
rescue StandardError => e
  Rails.logger.warn("[ContextCompactor] generate_parent_summary! failed: #{e.message}")
  nil
end

.schema_discovery_tool?(tool_name) ⇒ Boolean

Whether a tool name belongs to the schema-discovery family. Schema
discovery results are exempt from both verbatim-truncation and the
repeated-call dedup so Sunny keeps the column list in scope all turn.

Parameters:

  • tool_name (String, Symbol, nil)

Returns:

  • (Boolean)


58
59
60
61
# File 'app/services/assistant/context_compactor.rb', line 58

def self.schema_discovery_tool?(tool_name)
  name = tool_name.to_s
  SCHEMA_DISCOVERY_TOOL_SUFFIXES.any? { |suffix| name.end_with?(suffix) }
end