Class: Assistant::ChatService

Inherits:

Object

Object
Assistant::ChatService

show all

Includes:: PromptComposer

Defined in:: app/services/assistant/chat_service.rb

Overview

Service for AI-powered assistant chat using RubyLLM's acts_as_chat.
Uses tool-based architecture: the LLM calls registered tools (DB, content search, etc.)
rather than generating raw SQL. Conversation history is managed by RubyLLM automatically.

Defined Under Namespace

Classes: Result

Constant Summary collapse

THINKING_BUDGET_LOW = Extended Thinking configuration — gives reasoning models a scratchpad for multi-step problems (SQL construction, analytical reasoning). Budget is in tokens; Anthropic models require it, Gemini uses it as a cap.

4_000

THINKING_BUDGET_MEDIUM = Simple tool queries

8_000

THINKING_BUDGET_HIGH = Analytical queries with JOINs/aggregation

16_000

THINKING_QUERY_PATTERNS = Patterns that indicate the query would benefit from extended thinking

/\b(compare|analyze|trend|correlat|calculate|forecast|predict|why|root.?cause|deep.?dive|break.?down|step.?by.?step|optimize|investigate|audit|reconcil|year.?over.?year|month.?over.?month)\b/i

LLM_NETWORK_RETRY_EXCEPTIONS = Transient provider / TLS failures (AppSignal #4527: Faraday::SSLError SSL_read EOF).

[
  Faraday::SSLError,
  Faraday::ConnectionFailed,
  Faraday::TimeoutError,
  OpenSSL::SSL::SSLError
].freeze

MODELS = Available models with their configurations. Model IDs come from AiModelConstants — the single source of truth. supports_thinking: whether the model supports RubyLLM's with_thinking (extended reasoning) thinking_effort_default: the default effort level when thinking is activated (:low, :medium, :high)

{
  'claude-haiku'  => { id: AiModelConstants.id(:anthropic_haiku),  provider: :anthropic, label: 'Claude Haiku 4.5 (Fast)',      cost: :low,    supports_thinking: false },
  'claude-sonnet' => { id: AiModelConstants.id(:anthropic_sonnet), provider: :anthropic, label: 'Claude Sonnet 4.6 (Balanced)', cost: :medium, supports_thinking: true, thinking_effort_default: :medium },
  'claude-opus'   => { id: AiModelConstants.id(:anthropic_opus),   provider: :anthropic, label: 'Claude Opus 4.6 (Best)',       cost: :high,   supports_thinking: true, thinking_effort_default: :high },
  'gpt-5'         => { id: AiModelConstants.id(:openai_gpt5),      provider: :openai,    label: 'GPT-5 (OpenAI)',               cost: :medium, supports_thinking: false },
  'gpt-5.4'       => { id: AiModelConstants.id(:openai_gpt54),     provider: :openai,    label: 'GPT-5.4 (OpenAI Latest)',      cost: :medium, supports_thinking: false },
  'gpt-5-mini'    => { id: AiModelConstants.id(:openai_gpt5_mini), provider: :openai,    label: 'GPT-5 Mini (Fast)',            cost: :low,    supports_thinking: false },
  'gemini-flash'  => { id: AiModelConstants.id(:gemini_flash),     provider: :gemini,    label: 'Gemini 3 Flash (Google)',       cost: :low,    supports_thinking: true,  thinking_effort_default: :low },
  'gemini-pro'    => { id: AiModelConstants.id(:gemini_pro),       provider: :gemini,    label: 'Gemini 3.1 Pro (Google)',       cost: :medium, supports_thinking: true, thinking_effort_default: :medium }
}.freeze

DEFAULT_MODEL =

'gemini-flash'

MAX_PLAN_COST_USD = Hard cap on estimated plan execution cost (USD) across isolated step + assembly LLM calls. NOTE: plan_cost underestimates because run_plan_step_executor returns only the FINAL API round's tokens (not the cumulative total across tool-call rounds within a step). Real per-step cost is typically 5-10× higher than reported. The primary cost guard is the ToolLoopGuard's per-step call limit, not this cap.

2.00

MAX_PLAN_STEP_DURATION = Wall-clock timeout per plan step — driven from ToolLoopGuard so both the outer Timeout and the inner guard share a single source of truth.

Assistant::ToolLoopGuard::MAX_STEP_DURATION.seconds

STEP_RESULT_SUMMARIZE_THRESHOLD = Above this size, step output is summarized with a cheap model before the next step.

2_000

MID_TURN_COMPACT_THRESHOLD = Mid-turn compaction thresholds (see install_mid_turn_compaction!)

2_000

MID_TURN_KEEP_CHARS =

MID_TURN_SKIP_PREFIXES =

['[Compacted', '[Truncated', '[Already retrieved'].freeze

COMPLEX_QUERY_PATTERNS = Keywords indicating complex analytical or reasoning queries (need better models)

/\b(why|trend|pattern|anomaly|recommend|insight|correlation|predict|forecast|explain|root.?cause|deep.?dive|strategic|analyze|summarize|evaluate|pros?.and.cons|trade.?off)\b/i

COMPARISON_QUERY_PATTERNS = Keywords indicating multi-step comparison or research queries (need balanced models)

/\b(compare|vs|versus|between|difference|change|growth|decline|year.?over.?year|month.?over.?month|yoy|mom|research|investigate|audit)\b/i

SIMPLE_QUERY_PATTERNS = Keywords indicating simple lookup or factual queries (fast models are fine)

/\b(show|list|get|total|count|how many|what is|what are|sum|average|find|look up|search|where is|who is|when did)\b/i

COMPOSE_QUERY_PATTERNS = Phrases that indicate the user is drafting/composing a short message (email, follow-up, outreach, internal summary). These are quick content-generation tasks where Flash is fast and good enough — Pro's extended thinking is wasted budget here, and on long prompts (e.g. pasted email threads) we'd otherwise route them to Pro and time out.

/\b(reply|respond|send|email|follow.?up|outreach|reach out|thank.?you note|summary email)\b/i

WRITING_QUERY_PATTERNS = Phrases that indicate long-form editorial work (blog posts, articles, FAQs, rewrites). Flash produces noticeably weaker prose here — see /assistant/1639, where a Buffalo bathroom blog post written under Flash drew "wrote very poorly" feedback from the editor. Writing tasks always escalate to Gemini 3.1 Pro (or Claude Sonnet 4.6 when the conversation is already on the Claude family); Opus is intentionally excluded as too expensive for routine editorial work.

/\b(rewrite|polish|copyedit|copy.?edit|long.?form|article|blog post|blog ?article|blog ?entry|essay|narrative|edit blog|write the blog|draft the blog|update the blog|update the article|expand this section|tighten this|story|landing page copy|product description|press release|case study|whitepaper|white ?paper|content brief|seo copy|meta description|page copy|h(?:ero|eading) copy|body copy)\b/i

WRITING_MODEL_DEFAULT = Models we'll auto-route to for writing work. Keep tier ordering sensible (medium cost; never auto-pick Opus, which is reserved for explicit choice).

'gemini-pro'

WRITING_MODEL_CLAUDE =

'claude-sonnet'

WRITING_ELIGIBLE_MODELS =

[WRITING_MODEL_DEFAULT, WRITING_MODEL_CLAUDE].freeze

MODEL_COST_TIER = Cost tiers for model affinity decisions. Switching models mid-conversation loses accumulated reasoning context, so we only switch when escalating to a higher tier (never laterally).

MODELS.transform_values { |c| c[:cost] }.freeze

Constants included from PromptComposer

PromptComposer::AGENT_PROMPTS_DIR, PromptComposer::ANALYTICS_SERVICES, PromptComposer::DOMAIN_TOOL_REQUIREMENTS, PromptComposer::INSTRUCTIONS_TEMPLATE_PATH, PromptComposer::MESSAGE_DOMAIN_PATTERNS

Class Method Summary collapse

.auto_select_candidate(query, history_length: 0, current_model: nil) ⇒ Object
.auto_select_model(query, history_length: 0, current_model: nil) ⇒ Hash
Auto-select the best model based on query complexity.
.available_models ⇒ Object
Class method to get available models for UI (includes Auto option first).
.estimate_tokens(text) ⇒ Integer
Rough token estimate (1 token ≈ 4 chars for English).
.label_for_model(model_key) ⇒ Object
Resolve a stored model preference / llm_model_name (e.g. 'gemini-pro') to a human-readable label that includes the actual underlying model id (e.g. "Gemini 3.1 Pro Preview").

Instance Method Summary collapse

#call(&block) ⇒ Object
Execute the chat with streaming response.
#complete_only(&block) ⇒ Object
Retry path after emergency compaction: reconfigure the conversation and call complete() directly.
#emit_status(message) ⇒ Object protected
Emit a status update for the UI (non-content, just progress indicator).
#initialize(conversation:, user_message:, model: 'auto', tool_services: [], permitted_services: [], user_context: {}, on_status: nil, cancel_check: nil, attachments: []) ⇒ ChatService constructor
A new instance of ChatService.
#stream_content(content) ⇒ Object protected
Stream content to client AND capture for conversation history.
#with_instrumented_llm_call(feature:, source: 'sunny') ⇒ Object protected
Wraps an LLM call with PaperTrail audit context, CurrentScope user, instrumentation metadata, and transient network retries.

Constructor Details

#initialize(conversation:, user_message:, model: 'auto', tool_services: [], permitted_services: [], user_context: {}, on_status: nil, cancel_check: nil, attachments: []) ⇒ `ChatService`

Returns a new instance of ChatService.

Parameters:

conversation (AssistantConversation) —
The conversation record (acts_as_chat)
user_message (String) —
The user's query
model (String) (defaults to: 'auto') —
LLM model key or 'auto'
tool_services (Array<String>) (defaults to: []) —
Service keys for tool access
permitted_services (Array<String>) (defaults to: []) —
All service keys the user's role allows (for tool suggestion prompt)
user_context (Hash) (defaults to: {}) —
User identity for personalized queries
on_status (Proc) (defaults to: nil) —
Callback for status events
cancel_check (Proc) (defaults to: nil) —
Returns true when the caller wants to abort (e.g. user clicked Stop)
attachments (Array<Pathname>) (defaults to: []) —
Optional file paths to attach to the message (PDFs, images, etc.)

# File 'app/services/assistant/chat_service.rb', line 208

def initialize(conversation:, user_message:, model: 'auto', tool_services: [], permitted_services: [], user_context: {}, on_status: nil, cancel_check: nil, attachments: [])
  @conversation = conversation
  @user_message = user_message
  @tool_services = Array(tool_services).select(&:present?)
  @permitted_services = Array(permitted_services).select(&:present?)
  @user_context = user_context || {}
  @on_status = on_status
  @cancel_check = cancel_check
  @attachments = Array(attachments).select { |p|
    if p.respond_to?(:exist?)
      p.exist?                      # Pathname — check local file
    elsif p.to_s.start_with?('http://', 'https://')
      true                          # URL — pass through to RubyLLM
    else
      File.exist?(p.to_s)           # String path — check local file
    end
  }
  @auto_selected = false
  @model_selection_reason = nil

  # Derive role from user context for tool access control.
  # user_context is a serialized Hash from the controller with 'is_admin' and 'is_manager' keys.
  @user_role = if @user_context['is_admin']
                 :admin
               elsif @user_context['is_manager']
                 :manager
               else
                 :employee
               end

  # Resolve data domain access from the user's CanCanCan roles.
  # This narrows which views/tables the AI tools can query.
  @account = Account.find_by(id: @user_context['account_id']) if @user_context['account_id']
  @allowed_objects = @account ? Assistant::DataPolicy.allowed_objects_for_account(@account) : nil
  @analytics_domains = Array(@user_context['analytics_domains'])

  history_length = @conversation.assistant_messages.count

  # Handle 'auto' model selection
  if model == 'auto' || !MODELS.key?(model)
    selection = self.class.auto_select_model(
      user_message,
      history_length: history_length,
      current_model: @conversation.llm_model_name
    )
    @model_key = selection[:model]
    @model_selection_reason = selection[:reason]
    @auto_selected = true
  else
    @model_key = model
  end

  @model_config = MODELS[@model_key]
end

Class Method Details

.auto_select_candidate(query, history_length: 0, current_model: nil) ⇒ `Object`

# File 'app/services/assistant/chat_service.rb', line 142

def self.auto_select_candidate(query, history_length: 0, current_model: nil)
  query_lower = query.downcase.strip
  token_count = estimate_tokens(query)

  is_writing    = query_lower.match?(WRITING_QUERY_PATTERNS)
  is_complex    = query_lower.match?(COMPLEX_QUERY_PATTERNS)
  is_comparison = query_lower.match?(COMPARISON_QUERY_PATTERNS)
  is_compose    = query_lower.match?(COMPOSE_QUERY_PATTERNS)
  is_simple     = query_lower.match?(SIMPLE_QUERY_PATTERNS) && !is_comparison && !is_complex && !is_writing
  long_conversation = history_length > 20

  # Long-form editorial work (blog posts, articles, rewrites) must always run
  # on a Pro/Sonnet-tier model — Flash produces noticeably weaker prose. We
  # force_switch so a conversation that started on Flash doesn't hold writing
  # turns hostage via model affinity. Stay on Sonnet only if the conversation
  # is already on a Claude model; otherwise default to Gemini 3.1 Pro.
  if is_writing
    chosen = current_model == WRITING_MODEL_CLAUDE ? WRITING_MODEL_CLAUDE : WRITING_MODEL_DEFAULT
    return { model: chosen, reason: 'Writing/editorial task', force_switch: true }
  end

  # Compose/email tasks are short content generation, not analysis. Keep them
  # on Flash even when the prompt is long (pasted email threads inflate token
  # counts but don't require deep reasoning) — Pro burns most of
  # MAX_TURN_DURATION on extended thinking before any tool runs.
  return { model: 'gemini-flash', reason: 'Compose/email task', force_switch: true } if is_compose && !is_complex

  if is_complex || token_count > 200
    { model: 'gemini-pro', reason: 'Complex analytical query' }
  elsif is_comparison || token_count > 80
    { model: 'gemini-flash', reason: 'Multi-step query' }
  elsif is_simple && !long_conversation
    { model: 'gemini-flash', reason: 'Simple query' }
  else
    { model: 'gemini-flash', reason: long_conversation ? 'Long conversation context' : 'Standard query' }
  end
end

.auto_select_model(query, history_length: 0, current_model: nil) ⇒ `Hash`

Auto-select the best model based on query complexity.
Works for both analytics and general assistant queries.

Design goals:

Default to Gemini Flash for all queries — cheapest option with good quality.
Escalate to Gemini Pro only for genuinely complex analytical queries.
Claude models (Sonnet, Opus, Haiku) remain available via explicit user selection
but are never auto-selected, keeping Anthropic costs near zero for auto users.
Model affinity: if the conversation already uses a model, prefer keeping it
unless the new query demands a higher cost tier. Lateral switches lose
accumulated reasoning context for no benefit.

Parameters:

query (String) —
The user's question
history_length (Integer) (defaults to: 0) —
Number of messages in conversation history (informational only)
current_model (String, nil) (defaults to: nil) —
Model key currently in use (for affinity)

Returns:

(Hash) —
{ model: 'model-key', reason: 'explanation' }

# File 'app/services/assistant/chat_service.rb', line 117

def self.auto_select_model(query, history_length: 0, current_model: nil)
  candidate = auto_select_candidate(query, history_length: history_length, current_model: current_model)

  # Some intents (compose/email, writing) are strong enough signals that we
  # override model affinity. Compose pulls down to Flash so a long pasted
  # email thread doesn't burn the whole turn budget on Pro's extended
  # thinking (PR #618 / conv 1233). Writing pushes UP to Pro/Sonnet so we
  # never produce blog content on Flash (conv 1639 / Julia's feedback).
  return candidate.except(:force_switch) if candidate[:force_switch]

  if current_model.present? && MODELS.key?(current_model)
    candidate_tier = COST_TIER_RANK[MODEL_COST_TIER[candidate[:model]]] || 0
    current_tier   = COST_TIER_RANK[MODEL_COST_TIER[current_model]] || 0

    if candidate_tier <= current_tier
      return { model: current_model, reason: "#{candidate[:reason]} (keeping #{current_model} for context continuity)" }
    end
  end

  candidate
end

.available_models ⇒ `Object`

Class method to get available models for UI (includes Auto option first)

# File 'app/services/assistant/chat_service.rb', line 264

def self.available_models
  auto_option = [{ key: 'auto', label: 'Auto (Smart Select)', cost: :auto, model_id: nil }]
  model_options = MODELS.map do |key, config|
    { key: key, label: config[:label], cost: config[:cost], model_id: config[:id] }
  end
  auto_option + model_options
end

.estimate_tokens(text) ⇒ `Integer`

Rough token estimate (1 token ≈ 4 chars for English).
Used only for heuristic model-complexity selection, not billing.

Parameters:

text (String) —
Text to estimate tokens for

Returns:

(Integer) —
Estimated token count

# File 'app/services/assistant/chat_service.rb', line 184

def self.estimate_tokens(text)
  return 0 if text.blank?

  (text.length / 4.0).ceil
end

.label_for_model(model_key) ⇒ `Object`

Resolve a stored model preference / llm_model_name (e.g. 'gemini-pro') to a
human-readable label that includes the actual underlying model id (e.g.
"Gemini 3.1 Pro Preview"). Used by the chat picker and history badges so
users can see WHAT model actually ran a turn — not just the dropdown alias.
Returns the stored value verbatim when no MODELS entry matches.

# File 'app/services/assistant/chat_service.rb', line 277

def self.label_for_model(model_key)
  return 'Auto (Smart Select)' if model_key.to_s == 'auto'

  config = MODELS[model_key.to_s]
  return model_key.to_s if config.nil?

  config[:label]
end

Instance Method Details

#call(&block) ⇒ `Object`

Execute the chat with streaming response.
Messages auto-persist to assistant_messages via acts_as_chat
with token tracking, tool calls, and thinking traces.

Yields content chunks as they're generated.
Returns a Result with content and usage stats.

# File 'app/services/assistant/chat_service.rb', line 310

def call(&block)
  raise ArgumentError, 'Block required for streaming' unless block_given?

  @streamer = block
  @start_time = Process.clock_gettime(Process::CLOCK_MONOTONIC)
  @full_response = +''

  configure_conversation

  # Tell the conversation who the actual sender is so AssistantMessage
  # can stamp sender_id on the persisted user message.
  @conversation.current_sender_id = @user_context['party_id']

  # Stream the response — conversation.ask() auto-persists user + assistant messages.
  # The return value of ask() is the fully-assembled StreamAccumulator message with
  # correct input/output token counts (not the last streaming chunk, which has nil tokens).
  streamer_proc = build_streamer_proc

  final_message = with_instrumented_llm_call(feature: 'assistant_chat') do
    if @attachments.present?
      ask_with_attachments(user_message, @attachments, &streamer_proc)
    else
      @conversation.ask(user_message, &streamer_proc)
    end
  end

  halt_result = handle_halt(final_message, streamer_proc, label: 'call')
  return halt_result if halt_result

  build_result(final_message)
rescue Assistant::Cancelled
  Rails.logger.info("[Assistant::ChatService] Cancelled by user (call) — conversation #{@conversation.id}")
  build_cancelled_result
rescue RubyLLM::ContextLengthExceededError => err
  Rails.logger.error("[Assistant::ChatService] Context length exceeded: #{err.message}")
  raise
end

#complete_only(&block) ⇒ `Object`

Retry path after emergency compaction: reconfigure the conversation and
call complete() directly. The user message is already persisted from the
prior attempt — to_llm replays it from DB. Skips ask() to avoid duplicates.

# File 'app/services/assistant/chat_service.rb', line 351

def complete_only(&block)
  raise ArgumentError, 'Block required for streaming' unless block_given?

  @streamer = block
  @start_time = Process.clock_gettime(Process::CLOCK_MONOTONIC)
  @full_response = +''

  configure_conversation
  @conversation.current_sender_id = @user_context['party_id']

  streamer_proc = build_streamer_proc

  final_message = with_instrumented_llm_call(feature: 'assistant_chat') do
    llm_chat = @conversation.to_llm
    llm_chat.complete(&streamer_proc)
  end

  halt_result = handle_halt(final_message, streamer_proc, label: 'complete_only')
  return halt_result if halt_result

  build_result(final_message)
rescue Assistant::Cancelled
  Rails.logger.info("[Assistant::ChatService] Cancelled by user (complete_only) — conversation #{@conversation.id}")
  build_cancelled_result
end

#emit_status(message) ⇒ `Object` (protected)

Emit a status update for the UI (non-content, just progress indicator).
Also used by Assistant::PlanOrchestrator (via Object#send).



710
711
712

# File 'app/services/assistant/chat_service.rb', line 710

def emit_status(message)
  @on_status&.call(message)
end

#stream_content(content) ⇒ `Object` (protected)

Stream content to client AND capture for conversation history.
Also used by Assistant::PlanOrchestrator (via Object#send).

# File 'app/services/assistant/chat_service.rb', line 641

def stream_content(content)
  @full_response << content
  streamer.call(content)
end

#with_instrumented_llm_call(feature:, source: 'sunny') ⇒ `Object` (protected)

Wraps an LLM call with PaperTrail audit context, CurrentScope user, instrumentation
metadata, and transient network retries. Every LLM round (ask, complete, agent.ask)
should go through this so audit trail, cost logging, and retries are consistent.
Also used by Assistant::PlanOrchestrator (via Object#send).

# File 'app/services/assistant/chat_service.rb', line 414

def with_instrumented_llm_call(feature:, source: 'sunny')
  sender_id = @user_context['party_id'] || @conversation.user_id
  whodunnit = sender_id.to_s.presence || 'Sunny'

  PaperTrail.request(
    whodunnit: whodunnit,
    controller_info: {
      source: source,
      sender_id: sender_id,
      sender_name: @user_context['full_name'],
      conversation_id: @conversation.id,
      conversation_url: "/en-US/assistant/#{@conversation.id}"
    }
  ) do
    CurrentScope.with_user_id(sender_id) do
      RubyLLM::Instrumentation.with(
        feature: feature,
        conversation_id: @conversation.id,
        log_subject: @conversation,
        log_account_id: Account.where(party_id: sender_id).pick(:id)
      ) do
        with_llm_network_retries { yield }
      end
    end
  end
end

Class: Assistant::ChatService

Overview

Defined Under Namespace

Constant Summary collapse

Constants included from PromptComposer

Class Method Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(conversation:, user_message:, model: 'auto', tool_services: [], permitted_services: [], user_context: {}, on_status: nil, cancel_check: nil, attachments: []) ⇒ ChatService

Class Method Details

.auto_select_candidate(query, history_length: 0, current_model: nil) ⇒ Object

.auto_select_model(query, history_length: 0, current_model: nil) ⇒ Hash

.available_models ⇒ Object

.estimate_tokens(text) ⇒ Integer

.label_for_model(model_key) ⇒ Object

Instance Method Details

#call(&block) ⇒ Object

#complete_only(&block) ⇒ Object

#emit_status(message) ⇒ Object (protected)

#stream_content(content) ⇒ Object (protected)

#with_instrumented_llm_call(feature:, source: 'sunny') ⇒ Object (protected)

#initialize(conversation:, user_message:, model: 'auto', tool_services: [], permitted_services: [], user_context: {}, on_status: nil, cancel_check: nil, attachments: []) ⇒ `ChatService`

.auto_select_candidate(query, history_length: 0, current_model: nil) ⇒ `Object`

.auto_select_model(query, history_length: 0, current_model: nil) ⇒ `Hash`

.available_models ⇒ `Object`

.estimate_tokens(text) ⇒ `Integer`

.label_for_model(model_key) ⇒ `Object`

#call(&block) ⇒ `Object`

#complete_only(&block) ⇒ `Object`

#emit_status(message) ⇒ `Object` (protected)

#stream_content(content) ⇒ `Object` (protected)

#with_instrumented_llm_call(feature:, source: 'sunny') ⇒ `Object` (protected)