Class: Assistant::ChatService

Inherits:
Object
  • Object
show all
Includes:
PromptComposer
Defined in:
app/services/assistant/chat_service.rb

Overview

Service for AI-powered assistant chat using RubyLLM's acts_as_chat.
Uses tool-based architecture: the LLM calls registered tools (DB, content search, etc.)
rather than generating raw SQL. Conversation history is managed by RubyLLM automatically.

Defined Under Namespace

Classes: Result

Constant Summary collapse

THINKING_BUDGET_LOW =

Extended Thinking configuration — gives reasoning models a scratchpad
for multi-step problems (SQL construction, analytical reasoning).
Budget is in tokens; Anthropic models require it, Gemini uses it as a cap.

4_000
THINKING_BUDGET_MEDIUM =

Simple tool queries

8_000
THINKING_BUDGET_HIGH =

Analytical queries with JOINs/aggregation

16_000
THINKING_BUDGET_MAX =

Complex multi-step reasoning (Opus only)

64_000
EFFORT_RANK =

Relative ordering of thinking-effort tiers, lowest → highest. Lets the
router pick the higher of a query-driven floor and a model's configured
default, so a :max-default model is never capped at :high.

{ low: 0, medium: 1, high: 2, max: 3 }.freeze
THINKING_QUERY_PATTERNS =

Patterns that indicate the query would benefit from extended thinking

/\b(compare|analyze|trend|correlat|calculate|forecast|predict|why|root.?cause|deep.?dive|break.?down|step.?by.?step|optimize|investigate|audit|reconcil|year.?over.?year|month.?over.?month)\b/i
LLM_NETWORK_RETRY_EXCEPTIONS =

Transient provider / TLS failures (AppSignal #4527: Faraday::SSLError SSL_read EOF).

[
  Faraday::SSLError,
  Faraday::ConnectionFailed,
  Faraday::TimeoutError,
  OpenSSL::SSL::SSLError,
  RubyLLM::ServiceUnavailableError, # HTTP 502/503/504 — upstream gateway transient
  RubyLLM::OverloadedError          # HTTP 529 — Anthropic "service overloaded" transient
].freeze
MODELS =

Available models with their configurations.
Model IDs come from AiModelConstants — the single source of truth.
supports_thinking: whether the model supports RubyLLM's with_thinking (extended reasoning)
thinking_effort_default: the default effort level when thinking is activated (:low, :medium, :high, :max)

{
  'claude-haiku'  => { id: AiModelConstants.id(:anthropic_haiku),  provider: :anthropic, label: 'Claude Haiku 4.5 (Fast)',      cost: :low,    supports_thinking: false },
  'claude-sonnet' => { id: AiModelConstants.id(:anthropic_sonnet), provider: :anthropic, label: 'Claude Sonnet 4.6 (Balanced)', cost: :medium, supports_thinking: true, thinking_effort_default: :medium },
  'claude-opus'   => { id: AiModelConstants.id(:anthropic_opus),   provider: :anthropic, label: 'Claude Opus 4.8 (Best — highest cost)', cost: :high, supports_thinking: true, thinking_effort_default: :high },
  # Same Opus model, opened up to the 1M-token context window via the
  # Anthropic context-1m beta header (see configure_conversation). Top rung
  # of the complexity-escalation ladder for marathon / huge-context sessions.
  'claude-opus-1m' => { id: AiModelConstants.id(:anthropic_opus),  provider: :anthropic, label: 'Claude Opus 4.8 (1M context)', cost: :high, supports_thinking: true, thinking_effort_default: :high, context_1m: true },
  'gpt-5'         => { id: AiModelConstants.id(:openai_gpt5),      provider: :openai,    label: 'GPT-5 (OpenAI)',               cost: :medium, supports_thinking: false },
  'gpt-5.5'       => { id: AiModelConstants.id(:openai_gpt55),     provider: :openai,    label: 'GPT-5.5 (OpenAI Latest)',      cost: :medium, supports_thinking: false },
  'gpt-5-mini'    => { id: AiModelConstants.id(:openai_gpt5_mini), provider: :openai,    label: 'GPT-5 Mini (Fast)',            cost: :low,    supports_thinking: false },
  'gemini-flash'  => { id: AiModelConstants.id(:gemini_flash),     provider: :gemini,    label: 'Gemini 3.5 Flash (Recommended)', cost: :low, supports_thinking: true, thinking_effort_default: :low },
  'gemini-pro'    => { id: AiModelConstants.id(:gemini_pro),       provider: :gemini,    label: 'Gemini 3.5 Flash · High Reasoning (Google)', cost: :medium, supports_thinking: true, thinking_effort_default: :high }
}.freeze
DEFAULT_MODEL =

Default model.

'gemini-flash'
CONTEXT_1M_BETA =

Anthropic beta token that unlocks Opus's 1M-token context window, applied
only to the 'claude-opus-1m' model (context_1m: true) via with_headers.
Validated live against api.anthropic.com on 2026-06-03 — accepted (HTTP 200),
as was adaptive thinking at effort=max on claude-opus-4-8.

'context-1m-2025-08-07'
MAX_PLAN_COST_USD =

Hard cap on estimated plan execution cost (USD) across isolated step + assembly LLM calls.
NOTE: plan_cost underestimates because run_plan_step_executor returns only the FINAL
API round's tokens (not the cumulative total across tool-call rounds within a step).
Real per-step cost is typically 5-10× higher than reported. The primary cost guard is
the ToolLoopGuard's per-step call limit, not this cap.

2.00
MAX_PLAN_STEP_DURATION =

Wall-clock timeout per plan step — driven from ToolLoopGuard so both
the outer Timeout and the inner guard share a single source of truth.

Assistant::ToolLoopGuard::MAX_STEP_DURATION.seconds
STEP_RESULT_SUMMARIZE_THRESHOLD =

Above this size, step output is summarized with a cheap model before the next step.

2_000
MID_TURN_COMPACT_THRESHOLD =

Mid-turn compaction thresholds (see install_mid_turn_compaction!)

2_000
MID_TURN_KEEP_CHARS =

Mid turn keep chars.

600
MID_TURN_SKIP_PREFIXES =

Mid turn skip prefixes.

['[Compacted', '[Truncated', '[Already retrieved'].freeze
COMPLEX_QUERY_PATTERNS =

Keywords indicating complex analytical or reasoning queries (need better models)

/\b(why|trend|pattern|anomaly|recommend|insight|correlation|predict|forecast|explain|root.?cause|deep.?dive|strategic|analyze|summarize|evaluate|pros?.and.cons|trade.?off)\b/i
COMPARISON_QUERY_PATTERNS =

Keywords indicating multi-step comparison or research queries (need balanced models)

/\b(compare|vs|versus|between|difference|change|growth|decline|year.?over.?year|month.?over.?month|yoy|mom|research|investigate|audit)\b/i
SIMPLE_QUERY_PATTERNS =

Keywords indicating simple lookup or factual queries (fast models are fine)

/\b(show|list|get|total|count|how many|what is|what are|sum|average|find|look up|search|where is|who is|when did)\b/i
COMPOSE_QUERY_PATTERNS =

Phrases that indicate the user is drafting/composing a short message (email,
follow-up, outreach, internal summary). These are quick content-generation
tasks where Flash is fast and good enough — Pro's extended thinking is
wasted budget here, and on long prompts (e.g. pasted email threads) we'd
otherwise route them to Pro and time out.

/\b(reply|respond|send|email|follow.?up|outreach|reach out|thank.?you note|summary email)\b/i
WRITING_QUERY_PATTERNS =

Phrases that indicate long-form editorial work (blog posts, articles, FAQs,
rewrites). Flash produces noticeably weaker prose here — see /assistant/1639,
where a Buffalo bathroom blog post written under Flash drew "wrote very poorly"
feedback from the editor. Content-authoring tasks now route to Claude Sonnet:
every Gemini tier proved slow and unreliable on long HTML body edits — the
old gemini-3.1-pro preview intermittently 400'd (#3808) and ground out the
full 600s plan-step timeout on complex edits (#4714, conv 3098), which is why
the Gemini Pro snapshots were dropped from the registry entirely.
Opus is intentionally excluded as too expensive for routine editorial work.

/\b(rewrite|polish|copyedit|copy.?edit|long.?form|article|blog post|blog ?article|blog ?entry|essay|narrative|edit blog|write the blog|draft the blog|update the blog|update the article|expand this section|tighten this|story|landing page copy|product description|press release|case study|whitepaper|white ?paper|content brief|seo copy|meta description|page copy|h(?:ero|eading) copy|body copy|email template|email campaign|email blast|email copy|email design|newsletter)\b/i
CONTENT_AUTHORING_SERVICES =

Tool services whose presence marks a content-authoring turn. When the
classifier routes a turn to these, it gets Claude regardless of the query
wording (covers follow-ups like "now add a CTA" that lack writing keywords).

%w[blog_management email_management].freeze
WRITING_MODEL_DEFAULT =

Model we auto-route content-authoring work to. The one place we deliberately
auto-pick Anthropic — Claude is materially more reliable + faster at HTML
body editing than any Gemini tier. Opus stays opt-in (cost)
for general editorial; blog editing is the exception — see BLOG_AUTHORING_MODEL.

'claude-sonnet'
WRITING_MODEL_CLAUDE =

Writing model claude.

'claude-sonnet'
BLOG_AUTHORING_MODEL =

Blog editing is the heaviest content-authoring workload: large HTML bodies,
many block-level tool calls, long multi-turn sessions. On Gemini — and even
Sonnet — these turns repeatedly tripped the body-less Gemini 400 (#3808) and
the 600s plan-step timeout (#4714), and large posts got shredded by mid-turn
compaction — leaving the model editing from truncated HTML and looping until
it timed out (convs 3105/3109, Julia). Route blog editing to Opus 4.8 on the
1M-token context window from the FIRST turn so the model has both the
capability and the context headroom to finish without choking, instead of
starting cheap and escalating only after it has already failed. Cost is the
deliberate tradeoff for blog work specifically — email/general editorial
stay on Sonnet. Defined as a constant so the tier is easy to retune.

'claude-opus-1m'
BLOG_AUTHORING_SERVICES =

Classifier tool services that mark a blog authoring turn (vs. email).

%w[blog_management].freeze
BLOG_AUTHORING_PATTERNS =

Query wording that signals blog editing even without a classifier tool hint
(e.g. tests, or a turn the classifier abstained on). Deliberately blog-ONLY:
generic "article"/"the article" wording is left to content_authoring_turn? →
Sonnet, so a plain editorial edit isn't forced onto the pricier Opus-1M tier.

Also matches a pasted WarmlyYours blog-post URL (…/posts/) and the
"for this/the blog" lead-in: the common way an editor kicks off a blog task
is to paste the post URL ("for this blog https://…/posts/…/preview"), which
carries no other blog keyword and otherwise fell through to Gemini and hit
the intermittent body-less 400 (#3808, conv 3150). %r{} so the /posts/ path
needs no escaping.

%r{\b(blog post|blog ?article|blog ?entry|edit (?:the )?blog|update (?:the )?blog|write (?:the )?blog|draft (?:the )?blog|rewrite (?:the )?blog(?: post| article| entry)?|for (?:this|the) blog)\b|/posts/[\w-]+}i
WRITING_ELIGIBLE_MODELS =

Writing eligible models.

[WRITING_MODEL_DEFAULT, WRITING_MODEL_CLAUDE, BLOG_AUTHORING_MODEL].uniq.freeze
MODEL_COST_TIER =

Cost tiers for model affinity decisions.
Switching models mid-conversation loses accumulated reasoning context,
so we only switch when escalating to a higher tier (never laterally).

MODELS.transform_values { |c| c[:cost] }.freeze

Constants included from PromptComposer

PromptComposer::AGENT_PROMPTS_DIR, PromptComposer::ANALYTICS_SERVICES, PromptComposer::DOMAIN_TOOL_REQUIREMENTS, PromptComposer::INSTRUCTIONS_TEMPLATE_PATH, PromptComposer::MESSAGE_DOMAIN_PATTERNS

Instance Attribute Summary collapse

Class Method Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(conversation:, user_message:, model: 'auto', tool_services: [], permitted_services: [], user_context: {}, on_status: nil, cancel_check: nil, attachments: []) ⇒ ChatService

Returns a new instance of ChatService.

Parameters:

  • conversation (AssistantConversation)

    The conversation record (acts_as_chat)

  • user_message (String)

    The user's query

  • model (String) (defaults to: 'auto')

    LLM model key or 'auto'

  • tool_services (Array<String>) (defaults to: [])

    Service keys for tool access

  • permitted_services (Array<String>) (defaults to: [])

    All service keys the user's role allows (for tool suggestion prompt)

  • user_context (Hash) (defaults to: {})

    User identity for personalized queries

  • on_status (Proc) (defaults to: nil)

    Callback for status events

  • cancel_check (Proc) (defaults to: nil)

    Returns true when the caller wants to abort (e.g. user clicked Stop)

  • attachments (Array<Pathname>) (defaults to: [])

    Optional file paths to attach to the message (PDFs, images, etc.)



349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
# File 'app/services/assistant/chat_service.rb', line 349

def initialize(conversation:, user_message:, model: 'auto', tool_services: [], permitted_services: [], user_context: {}, on_status: nil, cancel_check: nil, attachments: [])
  @conversation = conversation
  @user_message = user_message
  @tool_services = Array(tool_services).compact_blank
  @permitted_services = Array(permitted_services).compact_blank
  @user_context = user_context || {}
  @on_status = on_status
  @cancel_check = cancel_check
  @attachments = Array(attachments).select do |p|
    if p.respond_to?(:exist?)
      p.exist?                      # Pathname — check local file
    elsif p.to_s.start_with?('http://', 'https://')
      true                          # URL — pass through to RubyLLM
    else
      File.exist?(p.to_s)           # String path — check local file
    end
  end
  @auto_selected = false
  @model_selection_reason = nil

  # Derive role from user context for tool access control.
  # user_context is a serialized Hash from the controller with 'is_admin' and 'is_manager' keys.
  @user_role = if @user_context['is_admin']
                 :admin
               elsif @user_context['is_manager']
                 :manager
               else
                 :employee
               end

  # Resolve data domain access from the user's CanCanCan roles.
  # This narrows which views/tables the AI tools can query.
  @account = Account.find_by(id: @user_context['account_id']) if @user_context['account_id']
  @allowed_objects = @account ? Assistant::DataPolicy.(@account) : nil
  @analytics_domains = Array(@user_context['analytics_domains'])

  history_length = @conversation.assistant_messages.count

  # Handle 'auto' model selection
  if model == 'auto' || !MODELS.key?(model)
    selection = self.class.auto_select_model(
      user_message,
      history_length: history_length,
      current_model: @conversation.llm_model_name,
      active_services: Array(@conversation.tool_services)
    )
    @model_key = selection[:model]
    @model_selection_reason = selection[:reason]
    @auto_selected = true
  else
    @model_key = model
  end

  # Complexity-aware upgrade: a session that STARTED cheap but has since
  # revealed its complexity — the model declared a multi-step plan, or the
  # conversation has grown long — climbs the model ladder. Only in auto mode
  # (never override an explicit user pick) and only while the user's monthly
  # budget allows; out of budget → stay on the cheap tier. See
  # Assistant::MonthlyBudget and
  # doc/tasks/202606031730_SUNNY_BUDGET_AND_AUTO_ESCALATION.md.
  #
  # Gate on the stored preference, not @auto_selected: the controller often
  # pre-resolves 'auto' to a concrete key before this point (one classifier
  # pass picks tools + tier), which would otherwise hide auto mode here.
  if auto_model_mode?
    upgrade = Assistant::ComplexityEscalator.upgrade(
      current_model: @model_key,
      plan_step_count: Array(@conversation.execution_plan&.dig('steps')).size,
      history_length: history_length,
      user_context: @user_context
    )
    if upgrade
      @model_key = upgrade[:model]
      @model_selection_reason = upgrade[:reason]
    end
  end

  @model_config = MODELS[@model_key]
end

Instance Attribute Details

#model_keyString (readonly)

The concrete model key this turn resolved to (e.g. +'claude-sonnet'+). For
an explicit pick this equals the requested model; for +'auto'+ it is the key
the complexity/affinity selector chose. Lets callers base a decision on the
backend that actually ran rather than the requested alias — e.g.
AssistantChatWorker's transient-400 recovery, so an +auto+ turn that
resolved to claude-sonnet retries on a different backend instead of
replaying the model that just 400'd.

Returns:

  • (String)


479
480
481
# File 'app/services/assistant/chat_service.rb', line 479

def model_key
  @model_key
end

Class Method Details

.auto_select_candidate(query, history_length: 0, current_model: nil) ⇒ Object



283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
# File 'app/services/assistant/chat_service.rb', line 283

def self.auto_select_candidate(query, history_length: 0, current_model: nil)
  query_lower = query.downcase.strip
  token_count = estimate_tokens(query)

  is_writing    = query_lower.match?(WRITING_QUERY_PATTERNS)
  is_complex    = query_lower.match?(COMPLEX_QUERY_PATTERNS)
  is_comparison = query_lower.match?(COMPARISON_QUERY_PATTERNS)
  is_compose    = query_lower.match?(COMPOSE_QUERY_PATTERNS)
  is_simple     = query_lower.match?(SIMPLE_QUERY_PATTERNS) && !is_comparison && !is_complex && !is_writing
  long_conversation = history_length > 20

  # Long-form editorial work (blog posts, articles, rewrites) must always run
  # on a Pro/Sonnet-tier model — Flash produces noticeably weaker prose. We
  # force_switch so a conversation that started on Flash doesn't hold writing
  # turns hostage via model affinity. Stay on Sonnet only if the conversation
  # is already on a Claude model; otherwise default to Claude Sonnet.
  if is_writing
    chosen = current_model == WRITING_MODEL_CLAUDE ? WRITING_MODEL_CLAUDE : WRITING_MODEL_DEFAULT
    return { model: chosen, reason: 'Writing/editorial task', force_switch: true }
  end

  # Compose/email tasks are short content generation, not analysis. Keep them
  # on Flash even when the prompt is long (pasted email threads inflate token
  # counts but don't require deep reasoning) — Pro burns most of
  # MAX_TURN_DURATION on extended thinking before any tool runs.
  return { model: 'gemini-flash', reason: 'Compose/email task', force_switch: true } if is_compose && !is_complex

  if is_complex || token_count > 200
    { model: 'gemini-pro', reason: 'Complex analytical query' }
  elsif is_comparison || token_count > 80
    { model: 'gemini-flash', reason: 'Multi-step query' }
  elsif is_simple && !long_conversation
    { model: 'gemini-flash', reason: 'Simple query' }
  else
    { model: 'gemini-flash', reason: long_conversation ? 'Long conversation context' : 'Standard query' }
  end
end

.auto_select_model(query, history_length: 0, current_model: nil, classifier_result: nil, active_services: []) ⇒ Hash

Auto-select the best model based on query complexity.
Works for both analytics and general assistant queries.

Design goals:

  • Prefer the AI classifier's tier when present — it sees the whole prompt
    holistically (multi-task structure, spelling variants, compound asks).
  • Fall back to regex-based candidate selection when the classifier abstained
    or wasn't run (e.g. tests that bypass the LLM call).
  • Default to Gemini Flash for all queries — cheapest option with good quality.
  • Escalate to the Gemini reasoning tier (same gemini-3.5-flash, higher
    thinking-effort budget) only for genuinely complex analytical queries.
  • Claude models are otherwise opt-in (explicit user selection), keeping
    Anthropic costs near zero for auto users — EXCEPT content-authoring
    (blog/email) turns, which always route to Claude Sonnet because the
    Gemini tiers are slow/unreliable on long HTML edits (see
    content_authoring_turn? / WRITING_MODEL_CLAUDE).
  • Model affinity: if the conversation already uses a model, prefer keeping it
    unless the new query demands a higher cost tier. Lateral switches lose
    accumulated reasoning context for no benefit.

Parameters:

  • query (String)

    The user's question

  • history_length (Integer) (defaults to: 0)

    Number of messages in conversation history (informational only)

  • current_model (String, nil) (defaults to: nil)

    Model key currently in use (for affinity)

  • classifier_result (Assistant::QueryClassifier::Result, nil) (defaults to: nil)

    pre-computed
    classification carrying a model_tier hint. When provided AND its tier is set,
    this overrides the regex candidate.

Returns:

  • (Hash)

    { model: 'model-key', reason: 'explanation' }



191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
# File 'app/services/assistant/chat_service.rb', line 191

def self.auto_select_model(query, history_length: 0, current_model: nil, classifier_result: nil, active_services: [])
  # Blog editing routes to Opus 4.8 (1M context) from the first turn — ahead of
  # everything else. These turns are large + tool-heavy and were choking on the
  # cheaper tiers (Gemini 400s #3808, 600s timeouts #4714, truncated-HTML
  # edit loops on large posts — convs 3105/3109). Give them the capable,
  # large-context model up front rather than escalating after failure.
  #
  # Detection is SESSION-aware, not just per-message: a blog session keeps Opus
  # on every turn even when the message itself carries no blog signal (e.g.
  # "retry", "yes confirmed", "clean up the HTML"). Without this, a short
  # continuation in a blog session silently fell back to Flash and thrashed
  # (conv 3117). active_services carries the conversation's enabled tool
  # services + forced chips.
  return { model: BLOG_AUTHORING_MODEL, reason: 'Blog session → Claude Opus 4.8 (1M context)' } if blog_authoring_turn?(query, classifier_result, active_services)

  # Other content-authoring (email/general editorial) turns route to Claude
  # Sonnet, ahead of the classifier/regex candidate AND model affinity. The
  # Gemini is slow/unreliable on long HTML edits — the dropped 3.1-pro-preview
  # snapshot 400'd intermittently (#3808) and burned the full 600s
  # plan-step timeout on complex edits (#4714, conv 3098).
  return { model: WRITING_MODEL_CLAUDE, reason: 'Content-authoring (email/editorial) → Claude' } if (query, classifier_result, active_services)

  candidate = if classifier_result&.model_tier
                candidate_from_classifier(classifier_result)
              else
                auto_select_candidate(query, history_length: history_length, current_model: current_model)
              end

  # Some intents (compose/email, writing) are strong enough signals that we
  # override model affinity. Compose pulls down to Flash so a long pasted
  # email thread doesn't burn the whole turn budget on Pro's extended
  # thinking (PR #618 / conv 1233). Writing pushes UP to Pro/Sonnet so we
  # never produce blog content on Flash (conv 1639 / Julia's feedback).
  return candidate.except(:force_switch) if candidate[:force_switch]

  if current_model.present? && MODELS.key?(current_model)
    candidate_tier = COST_TIER_RANK[MODEL_COST_TIER[candidate[:model]]] || 0
    current_tier   = COST_TIER_RANK[MODEL_COST_TIER[current_model]] || 0

    return { model: current_model, reason: "#{candidate[:reason]} (keeping #{current_model} for context continuity)" } if candidate_tier <= current_tier
  end

  candidate
end

.available_modelsObject

Class method to get available models for UI (includes Auto option first)



430
431
432
433
434
435
436
# File 'app/services/assistant/chat_service.rb', line 430

def self.available_models
  auto_option = [{ key: 'auto', label: 'Auto (Smart Select)', cost: :auto, model_id: nil }]
  model_options = MODELS.map do |key, config|
    { key: key, label: config[:label], cost: config[:cost], model_id: config[:id] }
  end
  auto_option + model_options
end

.estimate_tokens(text) ⇒ Integer

Rough token estimate (1 token ≈ 4 chars for English).
Used only for heuristic model-complexity selection, not billing.

Parameters:

  • text (String)

    Text to estimate tokens for

Returns:

  • (Integer)

    Estimated token count



325
326
327
328
329
# File 'app/services/assistant/chat_service.rb', line 325

def self.estimate_tokens(text)
  return 0 if text.blank?

  (text.length / 4.0).ceil
end

.label_for_model(model_key) ⇒ Object

Resolve a stored model preference / llm_model_name (e.g. 'gemini-pro') to a
human-readable label that includes the actual underlying model id (e.g.
"Gemini 3.5 Flash · Reasoning"). Used by the chat picker and history badges so
users can see WHAT model actually ran a turn — not just the dropdown alias.
Returns the stored value verbatim when no MODELS entry matches.



443
444
445
446
447
448
449
450
# File 'app/services/assistant/chat_service.rb', line 443

def self.label_for_model(model_key)
  return 'Auto (Smart Select)' if model_key.to_s == 'auto'

  config = MODELS[model_key.to_s]
  return model_key.to_s if config.nil?

  config[:label]
end

Instance Method Details

#call(&block) ⇒ Object

Execute the chat with streaming response.
Messages auto-persist to assistant_messages via acts_as_chat
with token tracking, tool calls, and thinking traces.

Yields content chunks as they're generated.
Returns a Result with content and usage stats.



487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
# File 'app/services/assistant/chat_service.rb', line 487

def call(&block)
  raise ArgumentError, 'Block required for streaming' unless block_given?

  @streamer = block
  @start_time = Process.clock_gettime(Process::CLOCK_MONOTONIC)
  @full_response = +''

  configure_conversation

  # Tell the conversation who the actual sender is so AssistantMessage
  # can stamp sender_id on the persisted user message.
  @conversation.current_sender_id = @user_context['party_id']

  # Stream the response — conversation.ask() auto-persists user + assistant messages.
  # The return value of ask() is the fully-assembled StreamAccumulator message with
  # correct input/output token counts (not the last streaming chunk, which has nil tokens).
  streamer_proc = build_streamer_proc

  final_message = with_instrumented_llm_call(feature: 'assistant_chat') do
    if @attachments.present?
      ask_with_attachments(user_message, @attachments, &streamer_proc)
    else
      @conversation.ask(user_message, &streamer_proc)
    end
  end

  halt_result = handle_halt(final_message, streamer_proc, label: 'call')
  return halt_result if halt_result

  build_result(final_message)
rescue Assistant::Cancelled
  Rails.logger.info("[Assistant::ChatService] Cancelled by user (call) — conversation #{@conversation.id}")
  build_cancelled_result
rescue RubyLLM::ContextLengthExceededError => e
  Rails.logger.error("[Assistant::ChatService] Context length exceeded: #{e.message}")
  raise
end

#complete_only(&block) ⇒ Object

Retry path after emergency compaction: reconfigure the conversation and
call complete() directly. The user message is already persisted from the
prior attempt — to_llm replays it from DB. Skips ask() to avoid duplicates.



528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
# File 'app/services/assistant/chat_service.rb', line 528

def complete_only(&block)
  raise ArgumentError, 'Block required for streaming' unless block_given?

  @streamer = block
  @start_time = Process.clock_gettime(Process::CLOCK_MONOTONIC)
  @full_response = +''

  configure_conversation
  @conversation.current_sender_id = @user_context['party_id']

  streamer_proc = build_streamer_proc

  final_message = with_instrumented_llm_call(feature: 'assistant_chat') do
    llm_chat = @conversation.to_llm
    llm_chat.complete(&streamer_proc)
  end

  halt_result = handle_halt(final_message, streamer_proc, label: 'complete_only')
  return halt_result if halt_result

  build_result(final_message)
rescue Assistant::Cancelled
  Rails.logger.info("[Assistant::ChatService] Cancelled by user (complete_only) — conversation #{@conversation.id}")
  build_cancelled_result
end

#emit_status(message) ⇒ Object (protected)

Emit a status update for the UI (non-content, just progress indicator).
Also used by Assistant::PlanOrchestrator (via Object#send).



1093
1094
1095
# File 'app/services/assistant/chat_service.rb', line 1093

def emit_status(message)
  @on_status&.call(message)
end

#stream_content(content) ⇒ Object (protected)

Stream content to client AND capture for conversation history.
Also used by Assistant::PlanOrchestrator (via Object#send).



973
974
975
976
# File 'app/services/assistant/chat_service.rb', line 973

def stream_content(content)
  @full_response << content
  streamer.call(content)
end

#with_instrumented_llm_call(feature:, source: 'sunny') { ... } ⇒ Object (protected)

Wraps an LLM call with PaperTrail audit context, CurrentScope user, instrumentation
metadata, and transient network retries. Every LLM round (ask, complete, agent.ask)
should go through this so audit trail, cost logging, and retries are consistent.
Also used by Assistant::PlanOrchestrator (via Object#send). On a body-less
RubyLLM::BadRequestError it attaches the outgoing request shape to AppSignal
(#3808) before re-raising.

Parameters:

  • feature (String)

    instrumentation feature tag (e.g. 'assistant_chat')

  • source (String) (defaults to: 'sunny')

    PaperTrail controller-info source (default 'sunny')

Yields:

  • the LLM call to instrument and retry

Returns:

  • (Object)

    the yielded block's return value (e.g. the final RubyLLM message)

Raises:

  • (RubyLLM::BadRequestError)

    re-raised after diagnostics are attached



607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
# File 'app/services/assistant/chat_service.rb', line 607

def with_instrumented_llm_call(feature:, source: 'sunny', &)
  sender_id = @user_context['party_id'] || @conversation.user_id
  whodunnit = sender_id.to_s.presence || 'Sunny'
   = Account.where(party_id: sender_id).pick(:id)

  # Multi-round agentic turns: the global instrumentation subscriber only
  # sees the FINAL tool-loop round's usage (RubyLLM returns the final
  # response up every recursive complete()), undercounting Sunny chat ~2x —
  # turns average ~18 billed rounds. assistant_chat is excluded from that
  # subscriber (MANUALLY_LOGGED_FEATURES); instead we sum this turn's
  # per-round assistant_messages in log_turn_usage! below, which reconcile
  # to the Anthropic Cost API within ~10%.
  sum_turn = feature == 'assistant_chat'
  since_id = sum_turn ? @conversation.assistant_messages.maximum(:id).to_i : nil

  result = PaperTrail.request(
    whodunnit: whodunnit,
    controller_info: {
      source: source,
      sender_id: sender_id,
      sender_name: @user_context['full_name'],
      conversation_id: @conversation.id,
      conversation_url: "/en-US/assistant/#{@conversation.id}"
    }
  ) do
    CurrentScope.with_user_id(sender_id) do
      RubyLLM::Instrumentation.with(
        feature: feature,
        conversation_id: @conversation.id,
        log_subject: @conversation,
        log_account_id: 
      ) do
        with_llm_network_retries(&)
      end
    end
  end

  log_turn_usage!(since_id, ) if sum_turn
  result
rescue RubyLLM::BadRequestError => e
  # A 400 on a STREAMING turn arrives body-less, so RubyLLM surfaces the
  # generic "Invalid request - please check your input" with no provider
  # detail — which left AppSignal #3808 undiagnosable for months (the real
  # reason is in the REQUEST we sent, not the empty response). Snapshot the
  # outgoing request shape onto the AppSignal transaction so the NEXT
  # occurrence names the offending payload (after #1069 fixed the dominant
  # Opus-4.7+-temperature cause, any residual cause is otherwise opaque).
  # Diagnostics must never mask the real error — re-raise unconditionally.
  attach_llm_request_diagnostics(e)
  raise
end