Class: Assistant::ContextCompactor

Inherits:

Object

Object
Assistant::ContextCompactor

show all

Defined in:: app/services/assistant/context_compactor.rb

Overview

Two-layer context compaction to keep AI assistant conversations within
token budgets and reduce cost.

Strategy 1 — Ephemeral Tool Result Compaction (post-exchange):
After a complete exchange, replace ALL large tool result messages with
compact summaries. Tool results (blog HTML, SQL output, API JSON) are
only needed verbatim during the exchange that consumed them — once the
response is streamed they are stale. Keeping them verbatim forces every
subsequent API call to re-send the same payload, bloating context rapidly.
This mirrors how Cursor treats file reads: ephemeral per-request injection,
not accumulated history.

Strategy 2 — Sliding Window + Summary (pre-exchange):
When estimated context exceeds a threshold, compress older messages
into a cached summary and keep only recent messages verbatim.

Usage:

After a response completes (in the worker):

Assistant::ContextCompactor.compact_tool_results!(conversation)

Before building the LLM payload (in to_llm):

Assistant::ContextCompactor.ensure_context_summary!(conversation)

Lifecycle hooks (compact / summarize / ensure / fork / emergency_compact)
all share the same conversation scan + tool-call/message classification,
so splitting would scatter knowledge of the message lifecycle. The
exempt-list of schema-discovery suffixes also has to live next to the
truncation routine that consults it.
rubocop:disable Metrics/ClassLength

Constant Summary collapse

TOOL_RESULT_CHAR_THRESHOLD = ── Thresholds ──────────────────────────────────────────────────── Tool results shorter than this (chars) are left as-is.

1_500

SCHEMA_DISCOVERY_TOOL_SUFFIXES = Schema-discovery results (column lists, table summaries) are referenced repeatedly across tool calls in the same turn — Sunny needs to know what columns exist on view_opportunities / view_activities every time it writes SQL, not just the first time. Truncating these to 800 chars stripped the answer from context and led to 290+ guessed-column SQL failures over a 10-day window. Exempt them from both the verbatim-truncation pass and the repeated-call dedup so the column list stays in scope for the whole turn.

%w[
  _describe_available_data
  _get_object_details
  _list_objects
  _list_schemas
].freeze

CONTEXT_TOKEN_THRESHOLD = When estimated context exceeds this many tokens, trigger sliding-window. When actual input_tokens from the API are available (preferred over char estimation), this threshold is applied directly against those real counts. Kept conservatively low so that tool-definition overhead + mid-flight tool-call additions don't push the next request over the 200k limit.

50_000

MAX_MESSAGES_BEFORE_SUMMARY = Also trigger sliding-window when the raw message count crosses this threshold, regardless of token estimates. Large-context models (Gemini 1M) never hit the token ceiling naturally, so we need a count-based safety valve. 40 messages ≈ 5-8 user turns with tool activity — enough history for coherent continuity.

MIN_RECENT_MESSAGES = Keep at least this many messages verbatim in the recent window. Ensures the LLM always sees enough context for coherent follow-ups. 6 messages ≈ 2 user/assistant exchanges — sufficient for continuity. Combined with immediate tool-result compaction, these messages are short.

SUMMARIZER_MODEL = Summarizer model.

AiModelConstants.id(:summarization)

CHARS_PER_TOKEN = Rough chars-per-token ratio. Conservative (real ratio is ~3.5 for English) so we trigger compaction a bit early rather than too late.

MAX_FORK_DEPTH = Maximum number of parent→child forks before we refuse to fork again. Prevents infinite cascade: conv → cont → cont → cont → …

OVERHEAD_TOKENS = Check whether the conversation needs a sliding-window summary. If context exceeds the threshold and the cached summary is stale (or absent), generate a new one. Returns the summary text (or nil if compaction is not needed). The caller (to_llm) uses this to decide whether to inject the summary and truncate old messages. param conversation [AssistantConversation] Estimated token overhead from system prompt and tool schemas that is NOT reflected in stored message content. Conservative estimate based on typical Sunny configurations: ~3K system prompt + ~2-15K per tool service. Returns: (String, nil) — summary text, or nil if not needed

15_000

DEDUP_STUB = Collapse repeated identical tool calls across all turns. When the same tool+args combination appears N times in history (e.g. the model called get_blog_post(868) in 6 different turns), the context re-sends the same data on every subsequent API call. This replaces the content of all but the MOST RECENT result for each unique (tool_name, arguments) signature with a lightweight stub, dramatically reducing context for chatty read-only tools. Skips messages already stubbed (content starts with "[Already retrieved"). Safe: the latest result is always preserved so the model can reference it.

'[Already retrieved earlier — omitted to reduce context]'

MIN_CALLS_TO_DEDUP = only dedup when ≥ this many identical calls exist

TRUNCATION_KEEP_CHARS = Truncation threshold: keep this many leading chars from the tool result. Enough to preserve key data points (IDs, titles, status, error messages) without carrying full blog HTML or SQL result sets into future turns.

CONTEXT_OVERFLOW_PATTERNS = Regex patterns that identify a provider-level context-length error across Anthropic, OpenAI, and Gemini. RubyLLM raises BadRequestError (400) for all.

[
  /prompt is too long/i,
  /maximum context length/i,
  /context_length_exceeded/i,
  /exceeds.*context.*window/i,
  /input.*too long/i,
  /too many tokens/i,
  /tokens.*exceed/i,
  /prompt.*exceeds.*limit/i,
  /reduce.*size.*message/i
].freeze

Class Method Summary collapse

.compact_tool_results!(conversation) ⇒ Object
Summarize large tool-result messages from the most recent exchange.
.compaction_cutoff_id(conversation) ⇒ Integer^?
Return the message ID through which the cached summary covers.
.context_overflow?(error) ⇒ Boolean
Returns true when the error message matches a known context-overflow pattern.
.emergency_compact!(conversation, level: 1) ⇒ Boolean
── Strategy 3: Emergency Compaction (context overflow recovery) ──.
.ensure_context_summary!(conversation) ⇒ Object
.fork_continuation!(conversation, pending_user_message: nil, tool_services: [], summarize: true) ⇒ AssistantConversation^?
── Strategy 4: Conversation Fork (manual / power-user) ───────.
.fork_from_message!(conversation, message_id:, summarize: true) ⇒ Array(AssistantConversation, String)^?
Fork a new conversation branching from a specific user message.
.generate_parent_summary!(conversation) ⇒ String^?
Generate and persist the context summary for a forked conversation whose summary was deferred at fork time (fork_from_message! with summarize: false).
.schema_discovery_tool?(tool_name) ⇒ Boolean
Whether a tool name belongs to the schema-discovery family.

Class Method Details

.compact_tool_results!(conversation) ⇒ `Object`

Summarize large tool-result messages from the most recent exchange.
Called after finalize_response in the worker so the assistant has
already consumed the raw data.

Parameters:

conversation (AssistantConversation)

# File 'app/services/assistant/context_compactor.rb', line 100

def self.compact_tool_results!(conversation)
  tool_messages = large_tool_results(conversation)

  unless tool_messages.empty?
    Rails.logger.info do
      "[ContextCompactor] Compacting #{tool_messages.size} tool result(s) " \
        "for conversation #{conversation.id}"
    end
    tool_messages.each { |msg| summarize_tool_result!(msg) }
  end

  # Deduplicate: when the same tool+args has been called multiple times across
  # turns, stub out older results so the context doesn't keep re-sending the
  # same data. The most recent result is preserved — it reflects current state.
  deduplicate_repeated_tool_calls!(conversation)
rescue StandardError => e
  # Compaction is best-effort — never break the main flow.
  Rails.logger.warn("[ContextCompactor] Tool result compaction failed: #{e.message}")
end

.compaction_cutoff_id(conversation) ⇒ `Integer`^?

Return the message ID through which the cached summary covers.
Used by to_llm to filter old messages.

Parameters:

conversation (AssistantConversation)

Returns:

(Integer, nil)



236
237
238

# File 'app/services/assistant/context_compactor.rb', line 236

def self.compaction_cutoff_id(conversation)
  conversation.compaction_through_message_id
end

.context_overflow?(error) ⇒ `Boolean`

Returns true when the error message matches a known context-overflow pattern.

Parameters:

error (StandardError)

Returns:

(Boolean)

# File 'app/services/assistant/context_compactor.rb', line 697

def self.context_overflow?(error)
  message_text = error.message.to_s
  CONTEXT_OVERFLOW_PATTERNS.any? { |pattern| message_text.match?(pattern) }
end

.emergency_compact!(conversation, level: 1) ⇒ `Boolean`

── Strategy 3: Emergency Compaction (context overflow recovery) ──

Called when the LLM rejects a request due to context length. Performs
progressively more aggressive compaction and returns true if the context
was reduced (caller should retry the LLM call).

Levels:

Force-compact all tool results (including small ones from current turn)
and regenerate the sliding-window summary with a tighter recent window.
Nuclear option — summarize everything except the last 2 messages,
compact ALL tool results regardless of size.

Parameters:

conversation (AssistantConversation)
level (Integer) (defaults to: 1) —
1 or 2

Returns:

(Boolean) —
true if compaction was performed (caller should retry)

# File 'app/services/assistant/context_compactor.rb', line 434

def self.emergency_compact!(conversation, level: 1)
  case level
  when 1 then emergency_compact_level_one!(conversation)
  when 2 then emergency_compact_level_two!(conversation)
  else false
  end
rescue StandardError => e
  Rails.logger.warn("[ContextCompactor] Emergency compaction level #{level} failed: #{e.message}")
  false
end

.ensure_context_summary!(conversation) ⇒ `Object`

# File 'app/services/assistant/context_compactor.rb', line 137

def self.ensure_context_summary!(conversation)
  message_rows = sliding_window_message_rows(conversation)
  return nil unless sliding_window_should_summarize?(conversation, message_rows)

  split_index = sliding_window_split_index(message_rows)
  return nil if split_index <= 0

  split_message_id = message_rows[split_index - 1][0] # last message included in "old"
  cached = sliding_window_cached_summary(conversation, split_message_id)
  return cached if cached

  summary = generate_conversation_summary(message_rows[0...split_index], conversation: conversation)
  sliding_window_persist!(conversation, summary, split_message_id)

  summary
rescue StandardError => e
  Rails.logger.warn("[ContextCompactor] Sliding window summary failed: #{e.message}")
  nil
end

.fork_continuation!(conversation, pending_user_message: nil, tool_services: [], summarize: true) ⇒ `AssistantConversation`^?

── Strategy 4: Conversation Fork (manual / power-user) ───────

Creates a new "continuation" conversation with the same owner, carrying a
compact summary of the full prior history as injected context. Used by
manual "Continue with context" and "Branch from here" actions.

NOTE: This is NOT used for automatic context overflow recovery — that is
handled by emergency_compact! + retry. Forking is a deliberate user action.

Parameters:

conversation (AssistantConversation) —
The source conversation
pending_user_message (String, nil) (defaults to: nil) —
The message to carry forward
tool_services (Array<String>) (defaults to: []) —
Tool services to carry forward
summarize (Boolean) (defaults to: true) —
When true (default) generates the context summary synchronously.
Pass false for a fast, redirect-friendly fork — the worker will generate the summary
before its first LLM call via +generate_parent_summary!+.

Returns:

(AssistantConversation, nil) —
The new continuation, or nil on failure

# File 'app/services/assistant/context_compactor.rb', line 534

def self.fork_continuation!(conversation, pending_user_message: nil, tool_services: [], summarize: true)
  return nil if fork_depth_exceeded?(conversation)

  summary = fork_continuation_summary(conversation, summarize: summarize)
  continuation = build_fork_record(conversation, suffix: '(cont.)', summary: summary, tool_services: tool_services)
  continuation.save!

  log_fork(conversation, continuation, summary: summary, pending: pending_user_message.present?)
  continuation
rescue StandardError => e
  Rails.logger.error("[ContextCompactor] fork_continuation! failed: #{e.message}")
  nil
end

.fork_from_message!(conversation, message_id:, summarize: true) ⇒ `Array(AssistantConversation, String)`^?

Fork a new conversation branching from a specific user message.

Parameters:

conversation (AssistantConversation) —
Source conversation
message_id (Integer) —
ID of the AssistantMessage to branch from (must be role 'user')
summarize (Boolean) (defaults to: true) —
When true (default) generates the context summary synchronously.
Pass false for a fast, redirect-friendly fork — the worker will generate the summary
before its first LLM call via +generate_parent_summary!+.

Returns:

(Array(AssistantConversation, String), nil) —
[fork, prefill_text], or nil on failure

# File 'app/services/assistant/context_compactor.rb', line 597

def self.fork_from_message!(conversation, message_id:, summarize: true)
  fork_msg = conversation.assistant_messages.where(role: 'user').find_by(id: message_id)
  return nil unless fork_msg

  summary = summarize ? fork_from_message_prior_summary(conversation, fork_msg) : nil
  fork_convo = build_fork_record(conversation, suffix: '(fork)', summary: summary,
                                               tool_services: conversation.tool_services)
  # Store the fork-point so the worker can summarise the right slice
  # of the parent conversation when summarize: false is used.
  fork_convo.metadata = fork_convo.metadata.merge('fork_message_id' => fork_msg.id)
  fork_convo.save!

  log_fork(conversation, fork_convo, summary: summary)
  [fork_convo, fork_msg.content.to_s]
rescue StandardError => e
  Rails.logger.error("[ContextCompactor] fork_from_message! failed: #{e.message}")
  nil
end

.generate_parent_summary!(conversation) ⇒ `String`^?

Generate and persist the context summary for a forked conversation whose summary
was deferred at fork time (fork_from_message! with summarize: false).

Reads prior messages from the parent conversation up to the stored fork_message_id,
generates the LLM summary, and persists it directly via a JSONB merge so concurrent
writes (e.g. the AssistantChatWorker token-count sync) cannot clobber it.

Safe to call multiple times — no-op if the summary already exists.

Parameters:

conversation (AssistantConversation)

Returns:

(String, nil) —
The generated summary, or nil if skipped / failed

# File 'app/services/assistant/context_compactor.rb', line 639

def self.generate_parent_summary!(conversation)
  return nil if conversation.parent_conversation_summary.present?

  parent = AssistantConversation.find_by(id: conversation.parent_conversation_id)
  return nil unless parent

  prior_msgs = parent_summary_prior_msgs(conversation, parent)
  return nil if prior_msgs.empty?

  summary = generate_conversation_summary(prior_msgs, conversation: parent)
  return nil if summary.blank?

  persist_parent_summary!(conversation, summary)
  summary
rescue StandardError => e
  Rails.logger.warn("[ContextCompactor] generate_parent_summary! failed: #{e.message}")
  nil
end

.schema_discovery_tool?(tool_name) ⇒ `Boolean`

Whether a tool name belongs to the schema-discovery family. Schema
discovery results are exempt from both verbatim-truncation and the
repeated-call dedup so Sunny keeps the column list in scope all turn.

Parameters:

tool_name (String, Symbol, nil)

Returns:

(Boolean)

# File 'app/services/assistant/context_compactor.rb', line 58

def self.schema_discovery_tool?(tool_name)
  name = tool_name.to_s
  SCHEMA_DISCOVERY_TOOL_SUFFIXES.any? { |suffix| name.end_with?(suffix) }
end

Class: Assistant::ContextCompactor

Overview

After a response completes (in the worker):

Before building the LLM payload (in to_llm):

Constant Summary collapse

Class Method Summary collapse

Class Method Details

.compact_tool_results!(conversation) ⇒ Object

.compaction_cutoff_id(conversation) ⇒ Integer?

.context_overflow?(error) ⇒ Boolean

.emergency_compact!(conversation, level: 1) ⇒ Boolean

.ensure_context_summary!(conversation) ⇒ Object

.fork_continuation!(conversation, pending_user_message: nil, tool_services: [], summarize: true) ⇒ AssistantConversation?

.fork_from_message!(conversation, message_id:, summarize: true) ⇒ Array(AssistantConversation, String)?

.generate_parent_summary!(conversation) ⇒ String?

.schema_discovery_tool?(tool_name) ⇒ Boolean

.compact_tool_results!(conversation) ⇒ `Object`

.compaction_cutoff_id(conversation) ⇒ `Integer`^?

.context_overflow?(error) ⇒ `Boolean`

.emergency_compact!(conversation, level: 1) ⇒ `Boolean`

.ensure_context_summary!(conversation) ⇒ `Object`

.fork_continuation!(conversation, pending_user_message: nil, tool_services: [], summarize: true) ⇒ `AssistantConversation`^?

.fork_from_message!(conversation, message_id:, summarize: true) ⇒ `Array(AssistantConversation, String)`^?

.generate_parent_summary!(conversation) ⇒ `String`^?

.schema_discovery_tool?(tool_name) ⇒ `Boolean`