Class: Assistant::ChatService

Inherits:
Object
  • Object
show all
Includes:
PromptComposer
Defined in:
app/services/assistant/chat_service.rb

Overview

Service for AI-powered assistant chat using RubyLLM's acts_as_chat.
Uses tool-based architecture: the LLM calls registered tools (DB, content search, etc.)
rather than generating raw SQL. Conversation history is managed by RubyLLM automatically.

Defined Under Namespace

Classes: Result

Constant Summary collapse

THINKING_BUDGET_LOW =

Extended Thinking configuration — gives reasoning models a scratchpad
for multi-step problems (SQL construction, analytical reasoning).
Budget is in tokens; Anthropic models require it, Gemini uses it as a cap.

4_000
THINKING_BUDGET_MEDIUM =

Simple tool queries

8_000
THINKING_BUDGET_HIGH =

Analytical queries with JOINs/aggregation

16_000
THINKING_QUERY_PATTERNS =

Patterns that indicate the query would benefit from extended thinking

/\b(compare|analyze|trend|correlat|calculate|forecast|predict|why|root.?cause|deep.?dive|break.?down|step.?by.?step|optimize|investigate|audit|reconcil|year.?over.?year|month.?over.?month)\b/i
LLM_NETWORK_RETRY_EXCEPTIONS =

Transient provider / TLS failures (AppSignal #4527: Faraday::SSLError SSL_read EOF).

[
  Faraday::SSLError,
  Faraday::ConnectionFailed,
  Faraday::TimeoutError,
  OpenSSL::SSL::SSLError
].freeze
MODELS =

Available models with their configurations.
Model IDs come from AiModelConstants — the single source of truth.
supports_thinking: whether the model supports RubyLLM's with_thinking (extended reasoning)
thinking_effort_default: the default effort level when thinking is activated (:low, :medium, :high)

{
  'claude-haiku'  => { id: AiModelConstants.id(:anthropic_haiku),  provider: :anthropic, label: 'Claude Haiku 4.5 (Fast)',      cost: :low,    supports_thinking: false },
  'claude-sonnet' => { id: AiModelConstants.id(:anthropic_sonnet), provider: :anthropic, label: 'Claude Sonnet 4.6 (Balanced)', cost: :medium, supports_thinking: true, thinking_effort_default: :medium },
  'claude-opus'   => { id: AiModelConstants.id(:anthropic_opus),   provider: :anthropic, label: 'Claude Opus 4.6 (Best)',       cost: :high,   supports_thinking: true, thinking_effort_default: :high },
  'gpt-5'         => { id: AiModelConstants.id(:openai_gpt5),      provider: :openai,    label: 'GPT-5 (OpenAI)',               cost: :medium, supports_thinking: false },
  'gpt-5.4'       => { id: AiModelConstants.id(:openai_gpt54),     provider: :openai,    label: 'GPT-5.4 (OpenAI Latest)',      cost: :medium, supports_thinking: false },
  'gpt-5-mini'    => { id: AiModelConstants.id(:openai_gpt5_mini), provider: :openai,    label: 'GPT-5 Mini (Fast)',            cost: :low,    supports_thinking: false },
  'gemini-flash'  => { id: AiModelConstants.id(:gemini_flash),     provider: :gemini,    label: 'Gemini 3 Flash (Google)',       cost: :low,    supports_thinking: true,  thinking_effort_default: :low },
  'gemini-pro'    => { id: AiModelConstants.id(:gemini_pro),       provider: :gemini,    label: 'Gemini 3.1 Pro (Google)',       cost: :medium, supports_thinking: true, thinking_effort_default: :medium }
}.freeze
DEFAULT_MODEL =
'gemini-flash'
MAX_PLAN_COST_USD =

Hard cap on estimated plan execution cost (USD) across isolated step + assembly LLM calls.
NOTE: plan_cost underestimates because run_plan_step_executor returns only the FINAL
API round's tokens (not the cumulative total across tool-call rounds within a step).
Real per-step cost is typically 5-10× higher than reported. The primary cost guard is
the ToolLoopGuard's per-step call limit, not this cap.

2.00
MAX_PLAN_STEP_DURATION =

Wall-clock timeout per plan step — driven from ToolLoopGuard so both
the outer Timeout and the inner guard share a single source of truth.

Assistant::ToolLoopGuard::MAX_STEP_DURATION.seconds
STEP_RESULT_SUMMARIZE_THRESHOLD =

Above this size, step output is summarized with a cheap model before the next step.

2_000
MID_TURN_COMPACT_THRESHOLD =

Mid-turn compaction thresholds (see install_mid_turn_compaction!)

2_000
MID_TURN_KEEP_CHARS =
600
MID_TURN_SKIP_PREFIXES =
['[Compacted', '[Truncated', '[Already retrieved'].freeze
COMPLEX_QUERY_PATTERNS =

Keywords indicating complex analytical or reasoning queries (need better models)

/\b(why|trend|pattern|anomaly|recommend|insight|correlation|predict|forecast|explain|root.?cause|deep.?dive|strategic|analyze|summarize|evaluate|pros?.and.cons|trade.?off)\b/i
COMPARISON_QUERY_PATTERNS =

Keywords indicating multi-step comparison or research queries (need balanced models)

/\b(compare|vs|versus|between|difference|change|growth|decline|year.?over.?year|month.?over.?month|yoy|mom|research|investigate|audit)\b/i
SIMPLE_QUERY_PATTERNS =

Keywords indicating simple lookup or factual queries (fast models are fine)

/\b(show|list|get|total|count|how many|what is|what are|sum|average|find|look up|search|where is|who is|when did)\b/i
COMPOSE_QUERY_PATTERNS =

Phrases that indicate the user is drafting/composing a short message (email,
follow-up, outreach, internal summary). These are quick content-generation
tasks where Flash is fast and good enough — Pro's extended thinking is
wasted budget here, and on long prompts (e.g. pasted email threads) we'd
otherwise route them to Pro and time out.

/\b(reply|respond|send|email|follow.?up|outreach|reach out|thank.?you note|summary email)\b/i
WRITING_QUERY_PATTERNS =

Phrases that indicate long-form editorial work (blog posts, articles, FAQs,
rewrites). Flash produces noticeably weaker prose here — see /assistant/1639,
where a Buffalo bathroom blog post written under Flash drew "wrote very poorly"
feedback from the editor. Writing tasks always escalate to Gemini 3.1 Pro
(or Claude Sonnet 4.6 when the conversation is already on the Claude family);
Opus is intentionally excluded as too expensive for routine editorial work.

/\b(rewrite|polish|copyedit|copy.?edit|long.?form|article|blog post|blog ?article|blog ?entry|essay|narrative|edit blog|write the blog|draft the blog|update the blog|update the article|expand this section|tighten this|story|landing page copy|product description|press release|case study|whitepaper|white ?paper|content brief|seo copy|meta description|page copy|h(?:ero|eading) copy|body copy)\b/i
WRITING_MODEL_DEFAULT =

Models we'll auto-route to for writing work. Keep tier ordering sensible
(medium cost; never auto-pick Opus, which is reserved for explicit choice).

'gemini-pro'
WRITING_MODEL_CLAUDE =
'claude-sonnet'
WRITING_ELIGIBLE_MODELS =
[WRITING_MODEL_DEFAULT, WRITING_MODEL_CLAUDE].freeze
MODEL_COST_TIER =

Cost tiers for model affinity decisions.
Switching models mid-conversation loses accumulated reasoning context,
so we only switch when escalating to a higher tier (never laterally).

MODELS.transform_values { |c| c[:cost] }.freeze

Constants included from PromptComposer

PromptComposer::AGENT_PROMPTS_DIR, PromptComposer::ANALYTICS_SERVICES, PromptComposer::DOMAIN_TOOL_REQUIREMENTS, PromptComposer::INSTRUCTIONS_TEMPLATE_PATH, PromptComposer::MESSAGE_DOMAIN_PATTERNS

Class Method Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(conversation:, user_message:, model: 'auto', tool_services: [], permitted_services: [], user_context: {}, on_status: nil, cancel_check: nil, attachments: []) ⇒ ChatService

Returns a new instance of ChatService.

Parameters:

  • conversation (AssistantConversation)

    The conversation record (acts_as_chat)

  • user_message (String)

    The user's query

  • model (String) (defaults to: 'auto')

    LLM model key or 'auto'

  • tool_services (Array<String>) (defaults to: [])

    Service keys for tool access

  • permitted_services (Array<String>) (defaults to: [])

    All service keys the user's role allows (for tool suggestion prompt)

  • user_context (Hash) (defaults to: {})

    User identity for personalized queries

  • on_status (Proc) (defaults to: nil)

    Callback for status events

  • cancel_check (Proc) (defaults to: nil)

    Returns true when the caller wants to abort (e.g. user clicked Stop)

  • attachments (Array<Pathname>) (defaults to: [])

    Optional file paths to attach to the message (PDFs, images, etc.)



208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
# File 'app/services/assistant/chat_service.rb', line 208

def initialize(conversation:, user_message:, model: 'auto', tool_services: [], permitted_services: [], user_context: {}, on_status: nil, cancel_check: nil, attachments: [])
  @conversation = conversation
  @user_message = user_message
  @tool_services = Array(tool_services).select(&:present?)
  @permitted_services = Array(permitted_services).select(&:present?)
  @user_context = user_context || {}
  @on_status = on_status
  @cancel_check = cancel_check
  @attachments = Array(attachments).select { |p|
    if p.respond_to?(:exist?)
      p.exist?                      # Pathname — check local file
    elsif p.to_s.start_with?('http://', 'https://')
      true                          # URL — pass through to RubyLLM
    else
      File.exist?(p.to_s)           # String path — check local file
    end
  }
  @auto_selected = false
  @model_selection_reason = nil

  # Derive role from user context for tool access control.
  # user_context is a serialized Hash from the controller with 'is_admin' and 'is_manager' keys.
  @user_role = if @user_context['is_admin']
                 :admin
               elsif @user_context['is_manager']
                 :manager
               else
                 :employee
               end

  # Resolve data domain access from the user's CanCanCan roles.
  # This narrows which views/tables the AI tools can query.
  @account = Account.find_by(id: @user_context['account_id']) if @user_context['account_id']
  @allowed_objects = @account ? Assistant::DataPolicy.(@account) : nil
  @analytics_domains = Array(@user_context['analytics_domains'])

  history_length = @conversation.assistant_messages.count

  # Handle 'auto' model selection
  if model == 'auto' || !MODELS.key?(model)
    selection = self.class.auto_select_model(
      user_message,
      history_length: history_length,
      current_model: @conversation.llm_model_name
    )
    @model_key = selection[:model]
    @model_selection_reason = selection[:reason]
    @auto_selected = true
  else
    @model_key = model
  end

  @model_config = MODELS[@model_key]
end

Class Method Details

.auto_select_candidate(query, history_length: 0, current_model: nil) ⇒ Object



142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
# File 'app/services/assistant/chat_service.rb', line 142

def self.auto_select_candidate(query, history_length: 0, current_model: nil)
  query_lower = query.downcase.strip
  token_count = estimate_tokens(query)

  is_writing    = query_lower.match?(WRITING_QUERY_PATTERNS)
  is_complex    = query_lower.match?(COMPLEX_QUERY_PATTERNS)
  is_comparison = query_lower.match?(COMPARISON_QUERY_PATTERNS)
  is_compose    = query_lower.match?(COMPOSE_QUERY_PATTERNS)
  is_simple     = query_lower.match?(SIMPLE_QUERY_PATTERNS) && !is_comparison && !is_complex && !is_writing
  long_conversation = history_length > 20

  # Long-form editorial work (blog posts, articles, rewrites) must always run
  # on a Pro/Sonnet-tier model — Flash produces noticeably weaker prose. We
  # force_switch so a conversation that started on Flash doesn't hold writing
  # turns hostage via model affinity. Stay on Sonnet only if the conversation
  # is already on a Claude model; otherwise default to Gemini 3.1 Pro.
  if is_writing
    chosen = current_model == WRITING_MODEL_CLAUDE ? WRITING_MODEL_CLAUDE : WRITING_MODEL_DEFAULT
    return { model: chosen, reason: 'Writing/editorial task', force_switch: true }
  end

  # Compose/email tasks are short content generation, not analysis. Keep them
  # on Flash even when the prompt is long (pasted email threads inflate token
  # counts but don't require deep reasoning) — Pro burns most of
  # MAX_TURN_DURATION on extended thinking before any tool runs.
  return { model: 'gemini-flash', reason: 'Compose/email task', force_switch: true } if is_compose && !is_complex

  if is_complex || token_count > 200
    { model: 'gemini-pro', reason: 'Complex analytical query' }
  elsif is_comparison || token_count > 80
    { model: 'gemini-flash', reason: 'Multi-step query' }
  elsif is_simple && !long_conversation
    { model: 'gemini-flash', reason: 'Simple query' }
  else
    { model: 'gemini-flash', reason: long_conversation ? 'Long conversation context' : 'Standard query' }
  end
end

.auto_select_model(query, history_length: 0, current_model: nil) ⇒ Hash

Auto-select the best model based on query complexity.
Works for both analytics and general assistant queries.

Design goals:

  • Default to Gemini Flash for all queries — cheapest option with good quality.
  • Escalate to Gemini Pro only for genuinely complex analytical queries.
  • Claude models (Sonnet, Opus, Haiku) remain available via explicit user selection
    but are never auto-selected, keeping Anthropic costs near zero for auto users.
  • Model affinity: if the conversation already uses a model, prefer keeping it
    unless the new query demands a higher cost tier. Lateral switches lose
    accumulated reasoning context for no benefit.

Parameters:

  • query (String)

    The user's question

  • history_length (Integer) (defaults to: 0)

    Number of messages in conversation history (informational only)

  • current_model (String, nil) (defaults to: nil)

    Model key currently in use (for affinity)

Returns:

  • (Hash)

    { model: 'model-key', reason: 'explanation' }



117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
# File 'app/services/assistant/chat_service.rb', line 117

def self.auto_select_model(query, history_length: 0, current_model: nil)
  candidate = auto_select_candidate(query, history_length: history_length, current_model: current_model)

  # Some intents (compose/email, writing) are strong enough signals that we
  # override model affinity. Compose pulls down to Flash so a long pasted
  # email thread doesn't burn the whole turn budget on Pro's extended
  # thinking (PR #618 / conv 1233). Writing pushes UP to Pro/Sonnet so we
  # never produce blog content on Flash (conv 1639 / Julia's feedback).
  return candidate.except(:force_switch) if candidate[:force_switch]

  if current_model.present? && MODELS.key?(current_model)
    candidate_tier = COST_TIER_RANK[MODEL_COST_TIER[candidate[:model]]] || 0
    current_tier   = COST_TIER_RANK[MODEL_COST_TIER[current_model]] || 0

    if candidate_tier <= current_tier
      return { model: current_model, reason: "#{candidate[:reason]} (keeping #{current_model} for context continuity)" }
    end
  end

  candidate
end

.available_modelsObject

Class method to get available models for UI (includes Auto option first)



264
265
266
267
268
269
270
# File 'app/services/assistant/chat_service.rb', line 264

def self.available_models
  auto_option = [{ key: 'auto', label: 'Auto (Smart Select)', cost: :auto, model_id: nil }]
  model_options = MODELS.map do |key, config|
    { key: key, label: config[:label], cost: config[:cost], model_id: config[:id] }
  end
  auto_option + model_options
end

.estimate_tokens(text) ⇒ Integer

Rough token estimate (1 token ≈ 4 chars for English).
Used only for heuristic model-complexity selection, not billing.

Parameters:

  • text (String)

    Text to estimate tokens for

Returns:

  • (Integer)

    Estimated token count



184
185
186
187
188
# File 'app/services/assistant/chat_service.rb', line 184

def self.estimate_tokens(text)
  return 0 if text.blank?

  (text.length / 4.0).ceil
end

.label_for_model(model_key) ⇒ Object

Resolve a stored model preference / llm_model_name (e.g. 'gemini-pro') to a
human-readable label that includes the actual underlying model id (e.g.
"Gemini 3.1 Pro Preview"). Used by the chat picker and history badges so
users can see WHAT model actually ran a turn — not just the dropdown alias.
Returns the stored value verbatim when no MODELS entry matches.



277
278
279
280
281
282
283
284
# File 'app/services/assistant/chat_service.rb', line 277

def self.label_for_model(model_key)
  return 'Auto (Smart Select)' if model_key.to_s == 'auto'

  config = MODELS[model_key.to_s]
  return model_key.to_s if config.nil?

  config[:label]
end

Instance Method Details

#call(&block) ⇒ Object

Execute the chat with streaming response.
Messages auto-persist to assistant_messages via acts_as_chat
with token tracking, tool calls, and thinking traces.

Yields content chunks as they're generated.
Returns a Result with content and usage stats.



310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
# File 'app/services/assistant/chat_service.rb', line 310

def call(&block)
  raise ArgumentError, 'Block required for streaming' unless block_given?

  @streamer = block
  @start_time = Process.clock_gettime(Process::CLOCK_MONOTONIC)
  @full_response = +''

  configure_conversation

  # Tell the conversation who the actual sender is so AssistantMessage
  # can stamp sender_id on the persisted user message.
  @conversation.current_sender_id = @user_context['party_id']

  # Stream the response — conversation.ask() auto-persists user + assistant messages.
  # The return value of ask() is the fully-assembled StreamAccumulator message with
  # correct input/output token counts (not the last streaming chunk, which has nil tokens).
  streamer_proc = build_streamer_proc

  final_message = with_instrumented_llm_call(feature: 'assistant_chat') do
    if @attachments.present?
      ask_with_attachments(user_message, @attachments, &streamer_proc)
    else
      @conversation.ask(user_message, &streamer_proc)
    end
  end

  halt_result = handle_halt(final_message, streamer_proc, label: 'call')
  return halt_result if halt_result

  build_result(final_message)
rescue Assistant::Cancelled
  Rails.logger.info("[Assistant::ChatService] Cancelled by user (call) — conversation #{@conversation.id}")
  build_cancelled_result
rescue RubyLLM::ContextLengthExceededError => err
  Rails.logger.error("[Assistant::ChatService] Context length exceeded: #{err.message}")
  raise
end

#complete_only(&block) ⇒ Object

Retry path after emergency compaction: reconfigure the conversation and
call complete() directly. The user message is already persisted from the
prior attempt — to_llm replays it from DB. Skips ask() to avoid duplicates.



351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
# File 'app/services/assistant/chat_service.rb', line 351

def complete_only(&block)
  raise ArgumentError, 'Block required for streaming' unless block_given?

  @streamer = block
  @start_time = Process.clock_gettime(Process::CLOCK_MONOTONIC)
  @full_response = +''

  configure_conversation
  @conversation.current_sender_id = @user_context['party_id']

  streamer_proc = build_streamer_proc

  final_message = with_instrumented_llm_call(feature: 'assistant_chat') do
    llm_chat = @conversation.to_llm
    llm_chat.complete(&streamer_proc)
  end

  halt_result = handle_halt(final_message, streamer_proc, label: 'complete_only')
  return halt_result if halt_result

  build_result(final_message)
rescue Assistant::Cancelled
  Rails.logger.info("[Assistant::ChatService] Cancelled by user (complete_only) — conversation #{@conversation.id}")
  build_cancelled_result
end

#emit_status(message) ⇒ Object (protected)

Emit a status update for the UI (non-content, just progress indicator).
Also used by Assistant::PlanOrchestrator (via Object#send).



710
711
712
# File 'app/services/assistant/chat_service.rb', line 710

def emit_status(message)
  @on_status&.call(message)
end

#stream_content(content) ⇒ Object (protected)

Stream content to client AND capture for conversation history.
Also used by Assistant::PlanOrchestrator (via Object#send).



641
642
643
644
# File 'app/services/assistant/chat_service.rb', line 641

def stream_content(content)
  @full_response << content
  streamer.call(content)
end

#with_instrumented_llm_call(feature:, source: 'sunny') ⇒ Object (protected)

Wraps an LLM call with PaperTrail audit context, CurrentScope user, instrumentation
metadata, and transient network retries. Every LLM round (ask, complete, agent.ask)
should go through this so audit trail, cost logging, and retries are consistent.
Also used by Assistant::PlanOrchestrator (via Object#send).



414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
# File 'app/services/assistant/chat_service.rb', line 414

def with_instrumented_llm_call(feature:, source: 'sunny')
  sender_id = @user_context['party_id'] || @conversation.user_id
  whodunnit = sender_id.to_s.presence || 'Sunny'

  PaperTrail.request(
    whodunnit: whodunnit,
    controller_info: {
      source: source,
      sender_id: sender_id,
      sender_name: @user_context['full_name'],
      conversation_id: @conversation.id,
      conversation_url: "/en-US/assistant/#{@conversation.id}"
    }
  ) do
    CurrentScope.with_user_id(sender_id) do
      RubyLLM::Instrumentation.with(
        feature: feature,
        conversation_id: @conversation.id,
        log_subject: @conversation,
        log_account_id: Account.where(party_id: sender_id).pick(:id)
      ) do
        with_llm_network_retries { yield }
      end
    end
  end
end