Class: Assistant::ChatService
- Inherits:
-
Object
- Object
- Assistant::ChatService
- Includes:
- PromptComposer
- Defined in:
- app/services/assistant/chat_service.rb
Overview
Service for AI-powered assistant chat using RubyLLM's acts_as_chat.
Uses tool-based architecture: the LLM calls registered tools (DB, content search, etc.)
rather than generating raw SQL. Conversation history is managed by RubyLLM automatically.
Defined Under Namespace
Classes: Result
Constant Summary collapse
- THINKING_BUDGET_LOW =
Extended Thinking configuration — gives reasoning models a scratchpad
for multi-step problems (SQL construction, analytical reasoning).
Budget is in tokens; Anthropic models require it, Gemini uses it as a cap. 4_000- THINKING_BUDGET_MEDIUM =
Simple tool queries
8_000- THINKING_BUDGET_HIGH =
Analytical queries with JOINs/aggregation
16_000- THINKING_QUERY_PATTERNS =
Patterns that indicate the query would benefit from extended thinking
/\b(compare|analyze|trend|correlat|calculate|forecast|predict|why|root.?cause|deep.?dive|break.?down|step.?by.?step|optimize|investigate|audit|reconcil|year.?over.?year|month.?over.?month)\b/i- LLM_NETWORK_RETRY_EXCEPTIONS =
Transient provider / TLS failures (AppSignal #4527: Faraday::SSLError SSL_read EOF).
[ Faraday::SSLError, Faraday::ConnectionFailed, Faraday::TimeoutError, OpenSSL::SSL::SSLError ].freeze
- MODELS =
Available models with their configurations.
Model IDs come from AiModelConstants — the single source of truth.
supports_thinking: whether the model supports RubyLLM's with_thinking (extended reasoning)
thinking_effort_default: the default effort level when thinking is activated (:low, :medium, :high) { 'claude-haiku' => { id: AiModelConstants.id(:anthropic_haiku), provider: :anthropic, label: 'Claude Haiku 4.5 (Fast)', cost: :low, supports_thinking: false }, 'claude-sonnet' => { id: AiModelConstants.id(:anthropic_sonnet), provider: :anthropic, label: 'Claude Sonnet 4.6 (Balanced)', cost: :medium, supports_thinking: true, thinking_effort_default: :medium }, 'claude-opus' => { id: AiModelConstants.id(:anthropic_opus), provider: :anthropic, label: 'Claude Opus 4.6 (Best)', cost: :high, supports_thinking: true, thinking_effort_default: :high }, 'gpt-5' => { id: AiModelConstants.id(:openai_gpt5), provider: :openai, label: 'GPT-5 (OpenAI)', cost: :medium, supports_thinking: false }, 'gpt-5.4' => { id: AiModelConstants.id(:openai_gpt54), provider: :openai, label: 'GPT-5.4 (OpenAI Latest)', cost: :medium, supports_thinking: false }, 'gpt-5-mini' => { id: AiModelConstants.id(:openai_gpt5_mini), provider: :openai, label: 'GPT-5 Mini (Fast)', cost: :low, supports_thinking: false }, 'gemini-flash' => { id: AiModelConstants.id(:gemini_flash), provider: :gemini, label: 'Gemini 3 Flash (Google)', cost: :low, supports_thinking: true, thinking_effort_default: :low }, 'gemini-pro' => { id: AiModelConstants.id(:gemini_pro), provider: :gemini, label: 'Gemini 3.1 Pro (Google)', cost: :medium, supports_thinking: true, thinking_effort_default: :medium } }.freeze
- DEFAULT_MODEL =
'gemini-flash'- MAX_PLAN_COST_USD =
Hard cap on estimated plan execution cost (USD) across isolated step + assembly LLM calls.
NOTE: plan_cost underestimates because run_plan_step_executor returns only the FINAL
API round's tokens (not the cumulative total across tool-call rounds within a step).
Real per-step cost is typically 5-10× higher than reported. The primary cost guard is
the ToolLoopGuard's per-step call limit, not this cap. 2.00- MAX_PLAN_STEP_DURATION =
Wall-clock timeout per plan step — driven from ToolLoopGuard so both
the outer Timeout and the inner guard share a single source of truth. Assistant::ToolLoopGuard::MAX_STEP_DURATION.seconds
- STEP_RESULT_SUMMARIZE_THRESHOLD =
Above this size, step output is summarized with a cheap model before the next step.
2_000- MID_TURN_COMPACT_THRESHOLD =
Mid-turn compaction thresholds (see install_mid_turn_compaction!)
2_000- MID_TURN_KEEP_CHARS =
600- MID_TURN_SKIP_PREFIXES =
['[Compacted', '[Truncated', '[Already retrieved'].freeze
- COMPLEX_QUERY_PATTERNS =
Keywords indicating complex analytical or reasoning queries (need better models)
/\b(why|trend|pattern|anomaly|recommend|insight|correlation|predict|forecast|explain|root.?cause|deep.?dive|strategic|analyze|summarize|evaluate|pros?.and.cons|trade.?off)\b/i- COMPARISON_QUERY_PATTERNS =
Keywords indicating multi-step comparison or research queries (need balanced models)
/\b(compare|vs|versus|between|difference|change|growth|decline|year.?over.?year|month.?over.?month|yoy|mom|research|investigate|audit)\b/i- SIMPLE_QUERY_PATTERNS =
Keywords indicating simple lookup or factual queries (fast models are fine)
/\b(show|list|get|total|count|how many|what is|what are|sum|average|find|look up|search|where is|who is|when did)\b/i- COMPOSE_QUERY_PATTERNS =
Phrases that indicate the user is drafting/composing a short message (email,
follow-up, outreach, internal summary). These are quick content-generation
tasks where Flash is fast and good enough — Pro's extended thinking is
wasted budget here, and on long prompts (e.g. pasted email threads) we'd
otherwise route them to Pro and time out. /\b(reply|respond|send|email|follow.?up|outreach|reach out|thank.?you note|summary email)\b/i- WRITING_QUERY_PATTERNS =
Phrases that indicate long-form editorial work (blog posts, articles, FAQs,
rewrites). Flash produces noticeably weaker prose here — see /assistant/1639,
where a Buffalo bathroom blog post written under Flash drew "wrote very poorly"
feedback from the editor. Writing tasks always escalate to Gemini 3.1 Pro
(or Claude Sonnet 4.6 when the conversation is already on the Claude family);
Opus is intentionally excluded as too expensive for routine editorial work. /\b(rewrite|polish|copyedit|copy.?edit|long.?form|article|blog post|blog ?article|blog ?entry|essay|narrative|edit blog|write the blog|draft the blog|update the blog|update the article|expand this section|tighten this|story|landing page copy|product description|press release|case study|whitepaper|white ?paper|content brief|seo copy|meta description|page copy|h(?:ero|eading) copy|body copy)\b/i- WRITING_MODEL_DEFAULT =
Models we'll auto-route to for writing work. Keep tier ordering sensible
(medium cost; never auto-pick Opus, which is reserved for explicit choice). 'gemini-pro'- WRITING_MODEL_CLAUDE =
'claude-sonnet'- WRITING_ELIGIBLE_MODELS =
[WRITING_MODEL_DEFAULT, WRITING_MODEL_CLAUDE].freeze
- MODEL_COST_TIER =
Cost tiers for model affinity decisions.
Switching models mid-conversation loses accumulated reasoning context,
so we only switch when escalating to a higher tier (never laterally). MODELS.transform_values { |c| c[:cost] }.freeze
Constants included from PromptComposer
PromptComposer::AGENT_PROMPTS_DIR, PromptComposer::ANALYTICS_SERVICES, PromptComposer::DOMAIN_TOOL_REQUIREMENTS, PromptComposer::INSTRUCTIONS_TEMPLATE_PATH, PromptComposer::MESSAGE_DOMAIN_PATTERNS
Class Method Summary collapse
- .auto_select_candidate(query, history_length: 0, current_model: nil) ⇒ Object
-
.auto_select_model(query, history_length: 0, current_model: nil) ⇒ Hash
Auto-select the best model based on query complexity.
-
.available_models ⇒ Object
Class method to get available models for UI (includes Auto option first).
-
.estimate_tokens(text) ⇒ Integer
Rough token estimate (1 token ≈ 4 chars for English).
-
.label_for_model(model_key) ⇒ Object
Resolve a stored model preference / llm_model_name (e.g. 'gemini-pro') to a human-readable label that includes the actual underlying model id (e.g. "Gemini 3.1 Pro Preview").
Instance Method Summary collapse
-
#call(&block) ⇒ Object
Execute the chat with streaming response.
-
#complete_only(&block) ⇒ Object
Retry path after emergency compaction: reconfigure the conversation and call complete() directly.
-
#emit_status(message) ⇒ Object
protected
Emit a status update for the UI (non-content, just progress indicator).
-
#initialize(conversation:, user_message:, model: 'auto', tool_services: [], permitted_services: [], user_context: {}, on_status: nil, cancel_check: nil, attachments: []) ⇒ ChatService
constructor
A new instance of ChatService.
-
#stream_content(content) ⇒ Object
protected
Stream content to client AND capture for conversation history.
-
#with_instrumented_llm_call(feature:, source: 'sunny') ⇒ Object
protected
Wraps an LLM call with PaperTrail audit context, CurrentScope user, instrumentation metadata, and transient network retries.
Constructor Details
#initialize(conversation:, user_message:, model: 'auto', tool_services: [], permitted_services: [], user_context: {}, on_status: nil, cancel_check: nil, attachments: []) ⇒ ChatService
Returns a new instance of ChatService.
208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 |
# File 'app/services/assistant/chat_service.rb', line 208 def initialize(conversation:, user_message:, model: 'auto', tool_services: [], permitted_services: [], user_context: {}, on_status: nil, cancel_check: nil, attachments: []) @conversation = conversation @user_message = @tool_services = Array(tool_services).select(&:present?) @permitted_services = Array(permitted_services).select(&:present?) @user_context = user_context || {} @on_status = on_status @cancel_check = cancel_check @attachments = Array().select { |p| if p.respond_to?(:exist?) p.exist? # Pathname — check local file elsif p.to_s.start_with?('http://', 'https://') true # URL — pass through to RubyLLM else File.exist?(p.to_s) # String path — check local file end } @auto_selected = false @model_selection_reason = nil # Derive role from user context for tool access control. # user_context is a serialized Hash from the controller with 'is_admin' and 'is_manager' keys. @user_role = if @user_context['is_admin'] :admin elsif @user_context['is_manager'] :manager else :employee end # Resolve data domain access from the user's CanCanCan roles. # This narrows which views/tables the AI tools can query. @account = Account.find_by(id: @user_context['account_id']) if @user_context['account_id'] @allowed_objects = @account ? Assistant::DataPolicy.allowed_objects_for_account(@account) : nil @analytics_domains = Array(@user_context['analytics_domains']) history_length = @conversation..count # Handle 'auto' model selection if model == 'auto' || !MODELS.key?(model) selection = self.class.auto_select_model( , history_length: history_length, current_model: @conversation.llm_model_name ) @model_key = selection[:model] @model_selection_reason = selection[:reason] @auto_selected = true else @model_key = model end @model_config = MODELS[@model_key] end |
Class Method Details
.auto_select_candidate(query, history_length: 0, current_model: nil) ⇒ Object
142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 |
# File 'app/services/assistant/chat_service.rb', line 142 def self.auto_select_candidate(query, history_length: 0, current_model: nil) query_lower = query.downcase.strip token_count = estimate_tokens(query) is_writing = query_lower.match?(WRITING_QUERY_PATTERNS) is_complex = query_lower.match?(COMPLEX_QUERY_PATTERNS) is_comparison = query_lower.match?(COMPARISON_QUERY_PATTERNS) is_compose = query_lower.match?(COMPOSE_QUERY_PATTERNS) is_simple = query_lower.match?(SIMPLE_QUERY_PATTERNS) && !is_comparison && !is_complex && !is_writing long_conversation = history_length > 20 # Long-form editorial work (blog posts, articles, rewrites) must always run # on a Pro/Sonnet-tier model — Flash produces noticeably weaker prose. We # force_switch so a conversation that started on Flash doesn't hold writing # turns hostage via model affinity. Stay on Sonnet only if the conversation # is already on a Claude model; otherwise default to Gemini 3.1 Pro. if is_writing chosen = current_model == WRITING_MODEL_CLAUDE ? WRITING_MODEL_CLAUDE : WRITING_MODEL_DEFAULT return { model: chosen, reason: 'Writing/editorial task', force_switch: true } end # Compose/email tasks are short content generation, not analysis. Keep them # on Flash even when the prompt is long (pasted email threads inflate token # counts but don't require deep reasoning) — Pro burns most of # MAX_TURN_DURATION on extended thinking before any tool runs. return { model: 'gemini-flash', reason: 'Compose/email task', force_switch: true } if is_compose && !is_complex if is_complex || token_count > 200 { model: 'gemini-pro', reason: 'Complex analytical query' } elsif is_comparison || token_count > 80 { model: 'gemini-flash', reason: 'Multi-step query' } elsif is_simple && !long_conversation { model: 'gemini-flash', reason: 'Simple query' } else { model: 'gemini-flash', reason: long_conversation ? 'Long conversation context' : 'Standard query' } end end |
.auto_select_model(query, history_length: 0, current_model: nil) ⇒ Hash
Auto-select the best model based on query complexity.
Works for both analytics and general assistant queries.
Design goals:
- Default to Gemini Flash for all queries — cheapest option with good quality.
- Escalate to Gemini Pro only for genuinely complex analytical queries.
- Claude models (Sonnet, Opus, Haiku) remain available via explicit user selection
but are never auto-selected, keeping Anthropic costs near zero for auto users. - Model affinity: if the conversation already uses a model, prefer keeping it
unless the new query demands a higher cost tier. Lateral switches lose
accumulated reasoning context for no benefit.
117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 |
# File 'app/services/assistant/chat_service.rb', line 117 def self.auto_select_model(query, history_length: 0, current_model: nil) candidate = auto_select_candidate(query, history_length: history_length, current_model: current_model) # Some intents (compose/email, writing) are strong enough signals that we # override model affinity. Compose pulls down to Flash so a long pasted # email thread doesn't burn the whole turn budget on Pro's extended # thinking (PR #618 / conv 1233). Writing pushes UP to Pro/Sonnet so we # never produce blog content on Flash (conv 1639 / Julia's feedback). return candidate.except(:force_switch) if candidate[:force_switch] if current_model.present? && MODELS.key?(current_model) candidate_tier = COST_TIER_RANK[MODEL_COST_TIER[candidate[:model]]] || 0 current_tier = COST_TIER_RANK[MODEL_COST_TIER[current_model]] || 0 if candidate_tier <= current_tier return { model: current_model, reason: "#{candidate[:reason]} (keeping #{current_model} for context continuity)" } end end candidate end |
.available_models ⇒ Object
Class method to get available models for UI (includes Auto option first)
264 265 266 267 268 269 270 |
# File 'app/services/assistant/chat_service.rb', line 264 def self.available_models auto_option = [{ key: 'auto', label: 'Auto (Smart Select)', cost: :auto, model_id: nil }] = MODELS.map do |key, config| { key: key, label: config[:label], cost: config[:cost], model_id: config[:id] } end auto_option + end |
.estimate_tokens(text) ⇒ Integer
Rough token estimate (1 token ≈ 4 chars for English).
Used only for heuristic model-complexity selection, not billing.
184 185 186 187 188 |
# File 'app/services/assistant/chat_service.rb', line 184 def self.estimate_tokens(text) return 0 if text.blank? (text.length / 4.0).ceil end |
.label_for_model(model_key) ⇒ Object
Resolve a stored model preference / llm_model_name (e.g. 'gemini-pro') to a
human-readable label that includes the actual underlying model id (e.g.
"Gemini 3.1 Pro Preview"). Used by the chat picker and history badges so
users can see WHAT model actually ran a turn — not just the dropdown alias.
Returns the stored value verbatim when no MODELS entry matches.
277 278 279 280 281 282 283 284 |
# File 'app/services/assistant/chat_service.rb', line 277 def self.label_for_model(model_key) return 'Auto (Smart Select)' if model_key.to_s == 'auto' config = MODELS[model_key.to_s] return model_key.to_s if config.nil? config[:label] end |
Instance Method Details
#call(&block) ⇒ Object
Execute the chat with streaming response.
Messages auto-persist to assistant_messages via acts_as_chat
with token tracking, tool calls, and thinking traces.
Yields content chunks as they're generated.
Returns a Result with content and usage stats.
310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 |
# File 'app/services/assistant/chat_service.rb', line 310 def call(&block) raise ArgumentError, 'Block required for streaming' unless block_given? @streamer = block @start_time = Process.clock_gettime(Process::CLOCK_MONOTONIC) @full_response = +'' configure_conversation # Tell the conversation who the actual sender is so AssistantMessage # can stamp sender_id on the persisted user message. @conversation.current_sender_id = @user_context['party_id'] # Stream the response — conversation.ask() auto-persists user + assistant messages. # The return value of ask() is the fully-assembled StreamAccumulator message with # correct input/output token counts (not the last streaming chunk, which has nil tokens). streamer_proc = build_streamer_proc = with_instrumented_llm_call(feature: 'assistant_chat') do if @attachments.present? (, @attachments, &streamer_proc) else @conversation.ask(, &streamer_proc) end end halt_result = handle_halt(, streamer_proc, label: 'call') return halt_result if halt_result build_result() rescue Assistant::Cancelled Rails.logger.info("[Assistant::ChatService] Cancelled by user (call) — conversation #{@conversation.id}") build_cancelled_result rescue RubyLLM::ContextLengthExceededError => err Rails.logger.error("[Assistant::ChatService] Context length exceeded: #{err.}") raise end |
#complete_only(&block) ⇒ Object
Retry path after emergency compaction: reconfigure the conversation and
call complete() directly. The user message is already persisted from the
prior attempt — to_llm replays it from DB. Skips ask() to avoid duplicates.
351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 |
# File 'app/services/assistant/chat_service.rb', line 351 def complete_only(&block) raise ArgumentError, 'Block required for streaming' unless block_given? @streamer = block @start_time = Process.clock_gettime(Process::CLOCK_MONOTONIC) @full_response = +'' configure_conversation @conversation.current_sender_id = @user_context['party_id'] streamer_proc = build_streamer_proc = with_instrumented_llm_call(feature: 'assistant_chat') do llm_chat = @conversation.to_llm llm_chat.complete(&streamer_proc) end halt_result = handle_halt(, streamer_proc, label: 'complete_only') return halt_result if halt_result build_result() rescue Assistant::Cancelled Rails.logger.info("[Assistant::ChatService] Cancelled by user (complete_only) — conversation #{@conversation.id}") build_cancelled_result end |
#emit_status(message) ⇒ Object (protected)
Emit a status update for the UI (non-content, just progress indicator).
Also used by Assistant::PlanOrchestrator (via Object#send).
710 711 712 |
# File 'app/services/assistant/chat_service.rb', line 710 def emit_status() @on_status&.call() end |
#stream_content(content) ⇒ Object (protected)
Stream content to client AND capture for conversation history.
Also used by Assistant::PlanOrchestrator (via Object#send).
641 642 643 644 |
# File 'app/services/assistant/chat_service.rb', line 641 def stream_content(content) @full_response << content streamer.call(content) end |
#with_instrumented_llm_call(feature:, source: 'sunny') ⇒ Object (protected)
Wraps an LLM call with PaperTrail audit context, CurrentScope user, instrumentation
metadata, and transient network retries. Every LLM round (ask, complete, agent.ask)
should go through this so audit trail, cost logging, and retries are consistent.
Also used by Assistant::PlanOrchestrator (via Object#send).
414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 |
# File 'app/services/assistant/chat_service.rb', line 414 def with_instrumented_llm_call(feature:, source: 'sunny') sender_id = @user_context['party_id'] || @conversation.user_id whodunnit = sender_id.to_s.presence || 'Sunny' PaperTrail.request( whodunnit: whodunnit, controller_info: { source: source, sender_id: sender_id, sender_name: @user_context['full_name'], conversation_id: @conversation.id, conversation_url: "/en-US/assistant/#{@conversation.id}" } ) do CurrentScope.with_user_id(sender_id) do RubyLLM::Instrumentation.with( feature: feature, conversation_id: @conversation.id, log_subject: @conversation, log_account_id: Account.where(party_id: sender_id).pick(:id) ) do with_llm_network_retries { yield } end end end end |