Class: Assistant::ChatService

Inherits:

Object

Object
Assistant::ChatService

show all

Includes:: PromptComposer

Defined in:: app/services/assistant/chat_service.rb

Overview

Service for AI-powered assistant chat using RubyLLM's acts_as_chat.
Uses tool-based architecture: the LLM calls registered tools (DB, content search, etc.)
rather than generating raw SQL. Conversation history is managed by RubyLLM automatically.

Defined Under Namespace

Classes: Result

Constant Summary collapse

THINKING_BUDGET_LOW = Extended Thinking configuration — gives reasoning models a scratchpad for multi-step problems (SQL construction, analytical reasoning). Budget is in tokens; Anthropic models require it, Gemini uses it as a cap.

4_000

THINKING_BUDGET_MEDIUM = Simple tool queries

8_000

THINKING_BUDGET_HIGH = Analytical queries with JOINs/aggregation

16_000

THINKING_BUDGET_MAX = Complex multi-step reasoning (Opus only)

64_000

EFFORT_RANK = Relative ordering of thinking-effort tiers, lowest → highest. Lets the router pick the higher of a query-driven floor and a model's configured default, so a :max-default model is never capped at :high.

{ low: 0, medium: 1, high: 2, max: 3 }.freeze

THINKING_QUERY_PATTERNS = Patterns that indicate the query would benefit from extended thinking

/\b(compare|analyze|trend|correlat|calculate|forecast|predict|why|root.?cause|deep.?dive|break.?down|step.?by.?step|optimize|investigate|audit|reconcil|year.?over.?year|month.?over.?month)\b/i

LLM_NETWORK_RETRY_EXCEPTIONS = Transient provider / TLS failures (AppSignal #4527: Faraday::SSLError SSL_read EOF).

[
  Faraday::SSLError,
  Faraday::ConnectionFailed,
  Faraday::TimeoutError,
  OpenSSL::SSL::SSLError,
  RubyLLM::ServiceUnavailableError, # HTTP 502/503/504 — upstream gateway transient
  RubyLLM::OverloadedError          # HTTP 529 — Anthropic "service overloaded" transient
].freeze

MODELS = Available models with their configurations. Model IDs come from AiModelConstants — the single source of truth. supports_thinking: whether the model supports RubyLLM's with_thinking (extended reasoning) thinking_effort_default: the default effort level when thinking is activated (:low, :medium, :high, :max)

{
  'claude-haiku'  => { id: AiModelConstants.id(:anthropic_haiku),  provider: :anthropic, label: 'Claude Haiku 4.5 (Fast)',      cost: :low,    supports_thinking: false },
  'claude-sonnet' => { id: AiModelConstants.id(:anthropic_sonnet), provider: :anthropic, label: 'Claude Sonnet 4.6 (Balanced)', cost: :medium, supports_thinking: true, thinking_effort_default: :medium },
  'claude-opus'   => { id: AiModelConstants.id(:anthropic_opus),   provider: :anthropic, label: 'Claude Opus 4.8 (Best — highest cost)', cost: :high, supports_thinking: true, thinking_effort_default: :high },
  # Same Opus model, opened up to the 1M-token context window via the
  # Anthropic context-1m beta header (see configure_conversation). Top rung
  # of the complexity-escalation ladder for marathon / huge-context sessions.
  'claude-opus-1m' => { id: AiModelConstants.id(:anthropic_opus),  provider: :anthropic, label: 'Claude Opus 4.8 (1M context)', cost: :high, supports_thinking: true, thinking_effort_default: :high, context_1m: true },
  'gpt-5'         => { id: AiModelConstants.id(:openai_gpt5),      provider: :openai,    label: 'GPT-5 (OpenAI)',               cost: :medium, supports_thinking: false },
  'gpt-5.5'       => { id: AiModelConstants.id(:openai_gpt55),     provider: :openai,    label: 'GPT-5.5 (OpenAI Latest)',      cost: :medium, supports_thinking: false },
  'gpt-5-mini'    => { id: AiModelConstants.id(:openai_gpt5_mini), provider: :openai,    label: 'GPT-5 Mini (Fast)',            cost: :low,    supports_thinking: false },
  'gemini-flash'  => { id: AiModelConstants.id(:gemini_flash),     provider: :gemini,    label: 'Gemini 3.5 Flash (Recommended)', cost: :low, supports_thinking: true, thinking_effort_default: :low },
  'gemini-pro'    => { id: AiModelConstants.id(:gemini_pro),       provider: :gemini,    label: 'Gemini 3.5 Flash · High Reasoning (Google)', cost: :medium, supports_thinking: true, thinking_effort_default: :high }
}.freeze

DEFAULT_MODEL = Default model.

'gemini-flash'

CONTEXT_1M_BETA = Anthropic beta token that unlocks Opus's 1M-token context window, applied only to the 'claude-opus-1m' model (context_1m: true) via with_headers. Validated live against api.anthropic.com on 2026-06-03 — accepted (HTTP 200), as was adaptive thinking at effort=max on claude-opus-4-8.

'context-1m-2025-08-07'

MAX_PLAN_COST_USD = Hard cap on estimated plan execution cost (USD) across isolated step + assembly LLM calls. NOTE: plan_cost underestimates because run_plan_step_executor returns only the FINAL API round's tokens (not the cumulative total across tool-call rounds within a step). Real per-step cost is typically 5-10× higher than reported. The primary cost guard is the ToolLoopGuard's per-step call limit, not this cap.

2.00

MAX_PLAN_STEP_DURATION = Wall-clock timeout per plan step — driven from ToolLoopGuard so both the outer Timeout and the inner guard share a single source of truth.

Assistant::ToolLoopGuard::MAX_STEP_DURATION.seconds

STEP_RESULT_SUMMARIZE_THRESHOLD = Above this size, step output is summarized with a cheap model before the next step.

2_000

MID_TURN_COMPACT_THRESHOLD = Mid-turn compaction thresholds (see install_mid_turn_compaction!)

2_000

MID_TURN_KEEP_CHARS = Mid turn keep chars.

MID_TURN_SKIP_PREFIXES = Mid turn skip prefixes.

['[Compacted', '[Truncated', '[Already retrieved'].freeze

COMPLEX_QUERY_PATTERNS = Keywords indicating complex analytical or reasoning queries (need better models)

/\b(why|trend|pattern|anomaly|recommend|insight|correlation|predict|forecast|explain|root.?cause|deep.?dive|strategic|analyze|summarize|evaluate|pros?.and.cons|trade.?off)\b/i

COMPARISON_QUERY_PATTERNS = Keywords indicating multi-step comparison or research queries (need balanced models)

/\b(compare|vs|versus|between|difference|change|growth|decline|year.?over.?year|month.?over.?month|yoy|mom|research|investigate|audit)\b/i

SIMPLE_QUERY_PATTERNS = Keywords indicating simple lookup or factual queries (fast models are fine)

/\b(show|list|get|total|count|how many|what is|what are|sum|average|find|look up|search|where is|who is|when did)\b/i

COMPOSE_QUERY_PATTERNS = Phrases that indicate the user is drafting/composing a short message (email, follow-up, outreach, internal summary). These are quick content-generation tasks where Flash is fast and good enough — Pro's extended thinking is wasted budget here, and on long prompts (e.g. pasted email threads) we'd otherwise route them to Pro and time out.

/\b(reply|respond|send|email|follow.?up|outreach|reach out|thank.?you note|summary email)\b/i

WRITING_QUERY_PATTERNS = Phrases that indicate long-form editorial work (blog posts, articles, FAQs, rewrites). Flash produces noticeably weaker prose here — see /assistant/1639, where a Buffalo bathroom blog post written under Flash drew "wrote very poorly" feedback from the editor. Content-authoring tasks now route to Claude Sonnet: every Gemini tier proved slow and unreliable on long HTML body edits — the old gemini-3.1-pro preview intermittently 400'd (#3808) and ground out the full 600s plan-step timeout on complex edits (#4714, conv 3098), which is why the Gemini Pro snapshots were dropped from the registry entirely. Opus is intentionally excluded as too expensive for routine editorial work.

/\b(rewrite|polish|copyedit|copy.?edit|long.?form|article|blog post|blog ?article|blog ?entry|essay|narrative|edit blog|write the blog|draft the blog|update the blog|update the article|expand this section|tighten this|story|landing page copy|product description|press release|case study|whitepaper|white ?paper|content brief|seo copy|meta description|page copy|h(?:ero|eading) copy|body copy|email template|email campaign|email blast|email copy|email design|newsletter)\b/i

CONTENT_AUTHORING_SERVICES = Tool services whose presence marks a content-authoring turn. When the classifier routes a turn to these, it gets Claude regardless of the query wording (covers follow-ups like "now add a CTA" that lack writing keywords).

%w[blog_management email_management].freeze

WRITING_MODEL_DEFAULT = Model we auto-route content-authoring work to. The one place we deliberately auto-pick Anthropic — Claude is materially more reliable + faster at HTML body editing than any Gemini tier. Opus stays opt-in (cost) for general editorial; blog editing is the exception — see BLOG_AUTHORING_MODEL.

'claude-sonnet'

WRITING_MODEL_CLAUDE = Writing model claude.

'claude-sonnet'

BLOG_AUTHORING_MODEL = Blog editing is the heaviest content-authoring workload: large HTML bodies, many block-level tool calls, long multi-turn sessions. On Gemini — and even Sonnet — these turns repeatedly tripped the body-less Gemini 400 (#3808) and the 600s plan-step timeout (#4714), and large posts got shredded by mid-turn compaction — leaving the model editing from truncated HTML and looping until it timed out (convs 3105/3109, Julia). Route blog editing to Opus 4.8 on the 1M-token context window from the FIRST turn so the model has both the capability and the context headroom to finish without choking, instead of starting cheap and escalating only after it has already failed. Cost is the deliberate tradeoff for blog work specifically — email/general editorial stay on Sonnet. Defined as a constant so the tier is easy to retune.

'claude-opus-1m'

BLOG_AUTHORING_SERVICES = Classifier tool services that mark a blog authoring turn (vs. email).

%w[blog_management].freeze

BLOG_AUTHORING_PATTERNS = Query wording that signals blog editing even without a classifier tool hint (e.g. tests, or a turn the classifier abstained on). Deliberately blog-ONLY: generic "article"/"the article" wording is left to content_authoring_turn? → Sonnet, so a plain editorial edit isn't forced onto the pricier Opus-1M tier. Also matches a pasted WarmlyYours blog-post URL (…/posts/) and the "for this/the blog" lead-in: the common way an editor kicks off a blog task is to paste the post URL ("for this blog https://…/posts/…/preview"), which carries no other blog keyword and otherwise fell through to Gemini and hit the intermittent body-less 400 (#3808, conv 3150). %r{} so the /posts/ path needs no escaping.

%r{\b(blog post|blog ?article|blog ?entry|edit (?:the )?blog|update (?:the )?blog|write (?:the )?blog|draft (?:the )?blog|rewrite (?:the )?blog(?: post| article| entry)?|for (?:this|the) blog)\b|/posts/[\w-]+}i

WRITING_ELIGIBLE_MODELS = Writing eligible models.

[WRITING_MODEL_DEFAULT, WRITING_MODEL_CLAUDE, BLOG_AUTHORING_MODEL].uniq.freeze

MODEL_COST_TIER = Cost tiers for model affinity decisions. Switching models mid-conversation loses accumulated reasoning context, so we only switch when escalating to a higher tier (never laterally).

MODELS.transform_values { |c| c[:cost] }.freeze

Constants included from PromptComposer

PromptComposer::AGENT_PROMPTS_DIR, PromptComposer::ANALYTICS_SERVICES, PromptComposer::DOMAIN_TOOL_REQUIREMENTS, PromptComposer::INSTRUCTIONS_TEMPLATE_PATH, PromptComposer::MESSAGE_DOMAIN_PATTERNS

Instance Attribute Summary collapse

#model_key ⇒ String readonly
The concrete model key this turn resolved to (e.g. +'claude-sonnet'+).

Class Method Summary collapse

.auto_select_candidate(query, history_length: 0, current_model: nil) ⇒ Object
.auto_select_model(query, history_length: 0, current_model: nil, classifier_result: nil, active_services: []) ⇒ Hash
Auto-select the best model based on query complexity.
.available_models ⇒ Object
Class method to get available models for UI (includes Auto option first).
.estimate_tokens(text) ⇒ Integer
Rough token estimate (1 token ≈ 4 chars for English).
.label_for_model(model_key) ⇒ Object
Resolve a stored model preference / llm_model_name (e.g. 'gemini-pro') to a human-readable label that includes the actual underlying model id (e.g. "Gemini 3.5 Flash · Reasoning").

Instance Method Summary collapse

#call(&block) ⇒ Object
Execute the chat with streaming response.
#complete_only(&block) ⇒ Object
Retry path after emergency compaction: reconfigure the conversation and call complete() directly.
#emit_status(message) ⇒ Object protected
Emit a status update for the UI (non-content, just progress indicator).
#initialize(conversation:, user_message:, model: 'auto', tool_services: [], permitted_services: [], user_context: {}, on_status: nil, cancel_check: nil, attachments: []) ⇒ ChatService constructor
A new instance of ChatService.
#stream_content(content) ⇒ Object protected
Stream content to client AND capture for conversation history.
#with_instrumented_llm_call(feature:, source: 'sunny') { ... } ⇒ Object protected
Wraps an LLM call with PaperTrail audit context, CurrentScope user, instrumentation metadata, and transient network retries.

Constructor Details

#initialize(conversation:, user_message:, model: 'auto', tool_services: [], permitted_services: [], user_context: {}, on_status: nil, cancel_check: nil, attachments: []) ⇒ `ChatService`

Returns a new instance of ChatService.

Parameters:

conversation (AssistantConversation) —
The conversation record (acts_as_chat)
user_message (String) —
The user's query
model (String) (defaults to: 'auto') —
LLM model key or 'auto'
tool_services (Array<String>) (defaults to: []) —
Service keys for tool access
permitted_services (Array<String>) (defaults to: []) —
All service keys the user's role allows (for tool suggestion prompt)
user_context (Hash) (defaults to: {}) —
User identity for personalized queries
on_status (Proc) (defaults to: nil) —
Callback for status events
cancel_check (Proc) (defaults to: nil) —
Returns true when the caller wants to abort (e.g. user clicked Stop)
attachments (Array<Pathname>) (defaults to: []) —
Optional file paths to attach to the message (PDFs, images, etc.)

# File 'app/services/assistant/chat_service.rb', line 349

def initialize(conversation:, user_message:, model: 'auto', tool_services: [], permitted_services: [], user_context: {}, on_status: nil, cancel_check: nil, attachments: [])
  @conversation = conversation
  @user_message = user_message
  @tool_services = Array(tool_services).compact_blank
  @permitted_services = Array(permitted_services).compact_blank
  @user_context = user_context || {}
  @on_status = on_status
  @cancel_check = cancel_check
  @attachments = Array(attachments).select do |p|
    if p.respond_to?(:exist?)
      p.exist?                      # Pathname — check local file
    elsif p.to_s.start_with?('http://', 'https://')
      true                          # URL — pass through to RubyLLM
    else
      File.exist?(p.to_s)           # String path — check local file
    end
  end
  @auto_selected = false
  @model_selection_reason = nil

  # Derive role from user context for tool access control.
  # user_context is a serialized Hash from the controller with 'is_admin' and 'is_manager' keys.
  @user_role = if @user_context['is_admin']
                 :admin
               elsif @user_context['is_manager']
                 :manager
               else
                 :employee
               end

  # Resolve data domain access from the user's CanCanCan roles.
  # This narrows which views/tables the AI tools can query.
  @account = Account.find_by(id: @user_context['account_id']) if @user_context['account_id']
  @allowed_objects = @account ? Assistant::DataPolicy.allowed_objects_for_account(@account) : nil
  @analytics_domains = Array(@user_context['analytics_domains'])

  history_length = @conversation.assistant_messages.count

  # Handle 'auto' model selection
  if model == 'auto' || !MODELS.key?(model)
    selection = self.class.auto_select_model(
      user_message,
      history_length: history_length,
      current_model: @conversation.llm_model_name,
      active_services: Array(@conversation.tool_services)
    )
    @model_key = selection[:model]
    @model_selection_reason = selection[:reason]
    @auto_selected = true
  else
    @model_key = model
  end

  # Complexity-aware upgrade: a session that STARTED cheap but has since
  # revealed its complexity — the model declared a multi-step plan, or the
  # conversation has grown long — climbs the model ladder. Only in auto mode
  # (never override an explicit user pick) and only while the user's monthly
  # budget allows; out of budget → stay on the cheap tier. See
  # Assistant::MonthlyBudget and
  # doc/tasks/202606031730_SUNNY_BUDGET_AND_AUTO_ESCALATION.md.
  #
  # Gate on the stored preference, not @auto_selected: the controller often
  # pre-resolves 'auto' to a concrete key before this point (one classifier
  # pass picks tools + tier), which would otherwise hide auto mode here.
  if auto_model_mode?
    upgrade = Assistant::ComplexityEscalator.upgrade(
      current_model: @model_key,
      plan_step_count: Array(@conversation.execution_plan&.dig('steps')).size,
      history_length: history_length,
      user_context: @user_context
    )
    if upgrade
      @model_key = upgrade[:model]
      @model_selection_reason = upgrade[:reason]
    end
  end

  @model_config = MODELS[@model_key]
end

Instance Attribute Details

#model_key ⇒ `String` (readonly)

The concrete model key this turn resolved to (e.g. +'claude-sonnet'+). For
an explicit pick this equals the requested model; for +'auto'+ it is the key
the complexity/affinity selector chose. Lets callers base a decision on the
backend that actually ran rather than the requested alias — e.g.
AssistantChatWorker's transient-400 recovery, so an +auto+ turn that
resolved to claude-sonnet retries on a different backend instead of
replaying the model that just 400'd.

Returns:

(String)



479
480
481

# File 'app/services/assistant/chat_service.rb', line 479

def model_key
  @model_key
end

Class Method Details

.auto_select_candidate(query, history_length: 0, current_model: nil) ⇒ `Object`

# File 'app/services/assistant/chat_service.rb', line 283

def self.auto_select_candidate(query, history_length: 0, current_model: nil)
  query_lower = query.downcase.strip
  token_count = estimate_tokens(query)

  is_writing    = query_lower.match?(WRITING_QUERY_PATTERNS)
  is_complex    = query_lower.match?(COMPLEX_QUERY_PATTERNS)
  is_comparison = query_lower.match?(COMPARISON_QUERY_PATTERNS)
  is_compose    = query_lower.match?(COMPOSE_QUERY_PATTERNS)
  is_simple     = query_lower.match?(SIMPLE_QUERY_PATTERNS) && !is_comparison && !is_complex && !is_writing
  long_conversation = history_length > 20

  # Long-form editorial work (blog posts, articles, rewrites) must always run
  # on a Pro/Sonnet-tier model — Flash produces noticeably weaker prose. We
  # force_switch so a conversation that started on Flash doesn't hold writing
  # turns hostage via model affinity. Stay on Sonnet only if the conversation
  # is already on a Claude model; otherwise default to Claude Sonnet.
  if is_writing
    chosen = current_model == WRITING_MODEL_CLAUDE ? WRITING_MODEL_CLAUDE : WRITING_MODEL_DEFAULT
    return { model: chosen, reason: 'Writing/editorial task', force_switch: true }
  end

  # Compose/email tasks are short content generation, not analysis. Keep them
  # on Flash even when the prompt is long (pasted email threads inflate token
  # counts but don't require deep reasoning) — Pro burns most of
  # MAX_TURN_DURATION on extended thinking before any tool runs.
  return { model: 'gemini-flash', reason: 'Compose/email task', force_switch: true } if is_compose && !is_complex

  if is_complex || token_count > 200
    { model: 'gemini-pro', reason: 'Complex analytical query' }
  elsif is_comparison || token_count > 80
    { model: 'gemini-flash', reason: 'Multi-step query' }
  elsif is_simple && !long_conversation
    { model: 'gemini-flash', reason: 'Simple query' }
  else
    { model: 'gemini-flash', reason: long_conversation ? 'Long conversation context' : 'Standard query' }
  end
end

.auto_select_model(query, history_length: 0, current_model: nil, classifier_result: nil, active_services: []) ⇒ `Hash`

Auto-select the best model based on query complexity.
Works for both analytics and general assistant queries.

Design goals:

Prefer the AI classifier's tier when present — it sees the whole prompt
holistically (multi-task structure, spelling variants, compound asks).
Fall back to regex-based candidate selection when the classifier abstained
or wasn't run (e.g. tests that bypass the LLM call).
Default to Gemini Flash for all queries — cheapest option with good quality.
Escalate to the Gemini reasoning tier (same gemini-3.5-flash, higher
thinking-effort budget) only for genuinely complex analytical queries.
Claude models are otherwise opt-in (explicit user selection), keeping
Anthropic costs near zero for auto users — EXCEPT content-authoring
(blog/email) turns, which always route to Claude Sonnet because the
Gemini tiers are slow/unreliable on long HTML edits (see
content_authoring_turn? / WRITING_MODEL_CLAUDE).
Model affinity: if the conversation already uses a model, prefer keeping it
unless the new query demands a higher cost tier. Lateral switches lose
accumulated reasoning context for no benefit.

Parameters:

query (String) —
The user's question
history_length (Integer) (defaults to: 0) —
Number of messages in conversation history (informational only)
current_model (String, nil) (defaults to: nil) —
Model key currently in use (for affinity)
classifier_result (Assistant::QueryClassifier::Result, nil) (defaults to: nil) —
pre-computed
classification carrying a model_tier hint. When provided AND its tier is set,
this overrides the regex candidate.

Returns:

(Hash) —
{ model: 'model-key', reason: 'explanation' }

# File 'app/services/assistant/chat_service.rb', line 191

def self.auto_select_model(query, history_length: 0, current_model: nil, classifier_result: nil, active_services: [])
  # Blog editing routes to Opus 4.8 (1M context) from the first turn — ahead of
  # everything else. These turns are large + tool-heavy and were choking on the
  # cheaper tiers (Gemini 400s #3808, 600s timeouts #4714, truncated-HTML
  # edit loops on large posts — convs 3105/3109). Give them the capable,
  # large-context model up front rather than escalating after failure.
  #
  # Detection is SESSION-aware, not just per-message: a blog session keeps Opus
  # on every turn even when the message itself carries no blog signal (e.g.
  # "retry", "yes confirmed", "clean up the HTML"). Without this, a short
  # continuation in a blog session silently fell back to Flash and thrashed
  # (conv 3117). active_services carries the conversation's enabled tool
  # services + forced chips.
  return { model: BLOG_AUTHORING_MODEL, reason: 'Blog session → Claude Opus 4.8 (1M context)' } if blog_authoring_turn?(query, classifier_result, active_services)

  # Other content-authoring (email/general editorial) turns route to Claude
  # Sonnet, ahead of the classifier/regex candidate AND model affinity. The
  # Gemini is slow/unreliable on long HTML edits — the dropped 3.1-pro-preview
  # snapshot 400'd intermittently (#3808) and burned the full 600s
  # plan-step timeout on complex edits (#4714, conv 3098).
  return { model: WRITING_MODEL_CLAUDE, reason: 'Content-authoring (email/editorial) → Claude' } if content_authoring_turn?(query, classifier_result, active_services)

  candidate = if classifier_result&.model_tier
                candidate_from_classifier(classifier_result)
              else
                auto_select_candidate(query, history_length: history_length, current_model: current_model)
              end

  # Some intents (compose/email, writing) are strong enough signals that we
  # override model affinity. Compose pulls down to Flash so a long pasted
  # email thread doesn't burn the whole turn budget on Pro's extended
  # thinking (PR #618 / conv 1233). Writing pushes UP to Pro/Sonnet so we
  # never produce blog content on Flash (conv 1639 / Julia's feedback).
  return candidate.except(:force_switch) if candidate[:force_switch]

  if current_model.present? && MODELS.key?(current_model)
    candidate_tier = COST_TIER_RANK[MODEL_COST_TIER[candidate[:model]]] || 0
    current_tier   = COST_TIER_RANK[MODEL_COST_TIER[current_model]] || 0

    return { model: current_model, reason: "#{candidate[:reason]} (keeping #{current_model} for context continuity)" } if candidate_tier <= current_tier
  end

  candidate
end

.available_models ⇒ `Object`

Class method to get available models for UI (includes Auto option first)

# File 'app/services/assistant/chat_service.rb', line 430

def self.available_models
  auto_option = [{ key: 'auto', label: 'Auto (Smart Select)', cost: :auto, model_id: nil }]
  model_options = MODELS.map do |key, config|
    { key: key, label: config[:label], cost: config[:cost], model_id: config[:id] }
  end
  auto_option + model_options
end

.estimate_tokens(text) ⇒ `Integer`

Rough token estimate (1 token ≈ 4 chars for English).
Used only for heuristic model-complexity selection, not billing.

Parameters:

text (String) —
Text to estimate tokens for

Returns:

(Integer) —
Estimated token count

# File 'app/services/assistant/chat_service.rb', line 325

def self.estimate_tokens(text)
  return 0 if text.blank?

  (text.length / 4.0).ceil
end

.label_for_model(model_key) ⇒ `Object`

Resolve a stored model preference / llm_model_name (e.g. 'gemini-pro') to a
human-readable label that includes the actual underlying model id (e.g.
"Gemini 3.5 Flash · Reasoning"). Used by the chat picker and history badges so
users can see WHAT model actually ran a turn — not just the dropdown alias.
Returns the stored value verbatim when no MODELS entry matches.

# File 'app/services/assistant/chat_service.rb', line 443

def self.label_for_model(model_key)
  return 'Auto (Smart Select)' if model_key.to_s == 'auto'

  config = MODELS[model_key.to_s]
  return model_key.to_s if config.nil?

  config[:label]
end

Instance Method Details

#call(&block) ⇒ `Object`

Execute the chat with streaming response.
Messages auto-persist to assistant_messages via acts_as_chat
with token tracking, tool calls, and thinking traces.

Yields content chunks as they're generated.
Returns a Result with content and usage stats.

# File 'app/services/assistant/chat_service.rb', line 487

def call(&block)
  raise ArgumentError, 'Block required for streaming' unless block_given?

  @streamer = block
  @start_time = Process.clock_gettime(Process::CLOCK_MONOTONIC)
  @full_response = +''

  configure_conversation

  # Tell the conversation who the actual sender is so AssistantMessage
  # can stamp sender_id on the persisted user message.
  @conversation.current_sender_id = @user_context['party_id']

  # Stream the response — conversation.ask() auto-persists user + assistant messages.
  # The return value of ask() is the fully-assembled StreamAccumulator message with
  # correct input/output token counts (not the last streaming chunk, which has nil tokens).
  streamer_proc = build_streamer_proc

  final_message = with_instrumented_llm_call(feature: 'assistant_chat') do
    if @attachments.present?
      ask_with_attachments(user_message, @attachments, &streamer_proc)
    else
      @conversation.ask(user_message, &streamer_proc)
    end
  end

  halt_result = handle_halt(final_message, streamer_proc, label: 'call')
  return halt_result if halt_result

  build_result(final_message)
rescue Assistant::Cancelled
  Rails.logger.info("[Assistant::ChatService] Cancelled by user (call) — conversation #{@conversation.id}")
  build_cancelled_result
rescue RubyLLM::ContextLengthExceededError => e
  Rails.logger.error("[Assistant::ChatService] Context length exceeded: #{e.message}")
  raise
end

#complete_only(&block) ⇒ `Object`

Retry path after emergency compaction: reconfigure the conversation and
call complete() directly. The user message is already persisted from the
prior attempt — to_llm replays it from DB. Skips ask() to avoid duplicates.

# File 'app/services/assistant/chat_service.rb', line 528

def complete_only(&block)
  raise ArgumentError, 'Block required for streaming' unless block_given?

  @streamer = block
  @start_time = Process.clock_gettime(Process::CLOCK_MONOTONIC)
  @full_response = +''

  configure_conversation
  @conversation.current_sender_id = @user_context['party_id']

  streamer_proc = build_streamer_proc

  final_message = with_instrumented_llm_call(feature: 'assistant_chat') do
    llm_chat = @conversation.to_llm
    llm_chat.complete(&streamer_proc)
  end

  halt_result = handle_halt(final_message, streamer_proc, label: 'complete_only')
  return halt_result if halt_result

  build_result(final_message)
rescue Assistant::Cancelled
  Rails.logger.info("[Assistant::ChatService] Cancelled by user (complete_only) — conversation #{@conversation.id}")
  build_cancelled_result
end

#emit_status(message) ⇒ `Object` (protected)

Emit a status update for the UI (non-content, just progress indicator).
Also used by Assistant::PlanOrchestrator (via Object#send).



1093
1094
1095

# File 'app/services/assistant/chat_service.rb', line 1093

def emit_status(message)
  @on_status&.call(message)
end

#stream_content(content) ⇒ `Object` (protected)

Stream content to client AND capture for conversation history.
Also used by Assistant::PlanOrchestrator (via Object#send).

# File 'app/services/assistant/chat_service.rb', line 973

def stream_content(content)
  @full_response << content
  streamer.call(content)
end

#with_instrumented_llm_call(feature:, source: 'sunny') { ... } ⇒ `Object` (protected)

Wraps an LLM call with PaperTrail audit context, CurrentScope user, instrumentation
metadata, and transient network retries. Every LLM round (ask, complete, agent.ask)
should go through this so audit trail, cost logging, and retries are consistent.
Also used by Assistant::PlanOrchestrator (via Object#send). On a body-less
RubyLLM::BadRequestError it attaches the outgoing request shape to AppSignal
(#3808) before re-raising.

Parameters:

feature (String) —
instrumentation feature tag (e.g. 'assistant_chat')
source (String) (defaults to: 'sunny') —
PaperTrail controller-info source (default 'sunny')

Yields:

the LLM call to instrument and retry

Returns:

(Object) —
the yielded block's return value (e.g. the final RubyLLM message)

Raises:

(RubyLLM::BadRequestError) —
re-raised after diagnostics are attached

# File 'app/services/assistant/chat_service.rb', line 607

def with_instrumented_llm_call(feature:, source: 'sunny', &)
  sender_id = @user_context['party_id'] || @conversation.user_id
  whodunnit = sender_id.to_s.presence || 'Sunny'
  account_id = Account.where(party_id: sender_id).pick(:id)

  # Multi-round agentic turns: the global instrumentation subscriber only
  # sees the FINAL tool-loop round's usage (RubyLLM returns the final
  # response up every recursive complete()), undercounting Sunny chat ~2x —
  # turns average ~18 billed rounds. assistant_chat is excluded from that
  # subscriber (MANUALLY_LOGGED_FEATURES); instead we sum this turn's
  # per-round assistant_messages in log_turn_usage! below, which reconcile
  # to the Anthropic Cost API within ~10%.
  sum_turn = feature == 'assistant_chat'
  since_id = sum_turn ? @conversation.assistant_messages.maximum(:id).to_i : nil

  result = PaperTrail.request(
    whodunnit: whodunnit,
    controller_info: {
      source: source,
      sender_id: sender_id,
      sender_name: @user_context['full_name'],
      conversation_id: @conversation.id,
      conversation_url: "/en-US/assistant/#{@conversation.id}"
    }
  ) do
    CurrentScope.with_user_id(sender_id) do
      RubyLLM::Instrumentation.with(
        feature: feature,
        conversation_id: @conversation.id,
        log_subject: @conversation,
        log_account_id: account_id
      ) do
        with_llm_network_retries(&)
      end
    end
  end

  log_turn_usage!(since_id, account_id) if sum_turn
  result
rescue RubyLLM::BadRequestError => e
  # A 400 on a STREAMING turn arrives body-less, so RubyLLM surfaces the
  # generic "Invalid request - please check your input" with no provider
  # detail — which left AppSignal #3808 undiagnosable for months (the real
  # reason is in the REQUEST we sent, not the empty response). Snapshot the
  # outgoing request shape onto the AppSignal transaction so the NEXT
  # occurrence names the offending payload (after #1069 fixed the dominant
  # Opus-4.7+-temperature cause, any residual cause is otherwise opaque).
  # Diagnostics must never mask the real error — re-raise unconditionally.
  attach_llm_request_diagnostics(e)
  raise
end

Class: Assistant::ChatService

Overview

Defined Under Namespace

Constant Summary collapse

Constants included from PromptComposer

Instance Attribute Summary collapse

Class Method Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(conversation:, user_message:, model: 'auto', tool_services: [], permitted_services: [], user_context: {}, on_status: nil, cancel_check: nil, attachments: []) ⇒ ChatService

Instance Attribute Details

#model_key ⇒ String (readonly)

Class Method Details

.auto_select_candidate(query, history_length: 0, current_model: nil) ⇒ Object

.auto_select_model(query, history_length: 0, current_model: nil, classifier_result: nil, active_services: []) ⇒ Hash

.available_models ⇒ Object

.estimate_tokens(text) ⇒ Integer

.label_for_model(model_key) ⇒ Object

Instance Method Details

#call(&block) ⇒ Object

#complete_only(&block) ⇒ Object

#emit_status(message) ⇒ Object (protected)

#stream_content(content) ⇒ Object (protected)

#with_instrumented_llm_call(feature:, source: 'sunny') { ... } ⇒ Object (protected)

#initialize(conversation:, user_message:, model: 'auto', tool_services: [], permitted_services: [], user_context: {}, on_status: nil, cancel_check: nil, attachments: []) ⇒ `ChatService`

#model_key ⇒ `String` (readonly)

.auto_select_candidate(query, history_length: 0, current_model: nil) ⇒ `Object`

.auto_select_model(query, history_length: 0, current_model: nil, classifier_result: nil, active_services: []) ⇒ `Hash`

.available_models ⇒ `Object`

.estimate_tokens(text) ⇒ `Integer`

.label_for_model(model_key) ⇒ `Object`

#call(&block) ⇒ `Object`

#complete_only(&block) ⇒ `Object`

#emit_status(message) ⇒ `Object` (protected)

#stream_content(content) ⇒ `Object` (protected)

#with_instrumented_llm_call(feature:, source: 'sunny') { ... } ⇒ `Object` (protected)