Skip to content

Cost Tracking & Optimization

KodaCode does two separate things here:

  • it makes cost visible
  • it gives you a few explicit ways to reduce cost without hiding runtime behavior

This page focuses on those user-facing optimization levers and on how to inspect what they actually saved.

These are the highest-leverage user-facing cost controls:

ControlWhat it changesBest when
default sessions.response_style: terseShorter ordinary assistant repliesYou want less narration and lower output token cost
/compressShrinks durable workspace prompt sourcesYour AGENTS.md or project memory has become repetitive
utility_modelRoutes runtime utility text work to a cheaper modelYou want to keep the main coding model but reduce background cost
output_budgetsCaps the output tokens requested by roleProvider/model ceilings are much larger than your normal turn needs
compaction_threshold and /compactSummarizes older stored history so long sessions do not replay full history foreverYou work in long-lived sessions
workflow.review_modelSends review passes to a cheaper modelYou use engineer with review flow and do not need the full primary model for review
budget and total_budgetWarns early and stops future turnsYou want hard spending guardrails

One reason the product defaults to terse mode is that it is usually the cheapest and simplest output-side cost cut.

KodaCode also cuts cost automatically. These optimizations try to remove repetition, schema clutter, or old raw output, not useful task context. /cost and /trace show the labels when savings apply.

LabelWhat saves costQuality impact
prompt compactionSends a shorter rule-based copy of structured instructionsBest when instructions use headings and bullets. Long prose with subtle nuance may compress less safely.
history compactionReplays a durable summary of older turns instead of full raw historyKeeps long sessions usable. Exact old details may need to be recovered from files or /trace.
current-turn projectionReplaces earlier large tool results in the same turn with placeholders after later work has used themReduces noise. If exact old output matters again, the model may need to re-read or rerun a focused tool call.
tool catalog compressionShortens provider-facing tool schemas and descriptions while preserving required fieldsUsually improves focus. Tools that need rich guidance keep it.
batch efficiencyHandles compatible independent tool calls in one tool stepDoes not remove information. Mutating and blocking tools still create runtime boundaries.

These savings are estimates, not provider invoices. KodaCode converts avoided input tokens to estimated dollars only when model pricing is known.

For best results, keep durable instructions concise and structured:

## Engineering Priorities
- Correctness first
- Clear ownership boundaries
- Run focused tests before finishing

Avoid burying important policy in long repeated prose. It is harder to compact and harder for humans to maintain.

response_style: terse reduces ordinary model reply length, which directly cuts output token cost for interactive sessions. This is already the default:

sessions:
response_style: default # set this only if you want fuller ordinary narration

This is the fastest user-facing cost cut because it does not change tools, routing, or the TUI. It just tells the model to keep ordinary prose brief.

Terse mode does not shorten:

  • safety warnings
  • permission explanations
  • destructive-action confirmations
  • ambiguity clarifications that still need to be explicit

For the full prompt-level behavior, see Context Management.

utility_model routes background tasks to a cheaper model while keeping your primary route for real coding turns:

utility_model: openai/gpt-5-mini
utility_model_timeout_seconds: 20 # default: 0 (no timeout)
utility_model_retry_attempts: 1 # default: 1
utility_model_retry_after_max_seconds: 5 # default: 5

Tasks that use the utility model:

  • session title generation
  • workspace prompt-source compression via /compress
  • history summaries for long sessions

These are tracked separately in the cost dialog so you can see what utility work costs versus agent turns.

output_budgets separates normal requested output from provider hard ceilings. This matters for models that can emit very large responses: the provider ceiling is still available when needed, but ordinary turns do not ask for the maximum by default.

output_budgets:
agent_turn: 8192
agent_turn_thinking: 16000
review: 4096
utility_text: 2048

For one model, use a model override:

model_overrides:
- ref: anthropic/claude-sonnet-4-6
max_output_tokens: 64000
default_output_tokens: 12000

max_output_tokens is the ceiling. default_output_tokens is the ordinary agent-turn request, and KodaCode clamps it to the ceiling.

If you use engineer workflow review, you can configure a cheaper review model without changing the main agent model:

workflow:
review_mode: manual
review_model:
primary: openai/gpt-5-mini

This reduces the cost of review passes while keeping the main model for implementation turns.

/compress rewrites workspace AGENTS.md and project memory entries through the same utility-model path used for other runtime text tasks. Unlike prompt compaction, this changes the durable source files on disk, so future turns start from smaller inputs instead of only receiving a provider-compacted rendering.

Use it when:

  • repo instructions have accreted duplicated sections
  • project memory entries are factual but wordy
  • you want lower steady-state prompt cost for future sessions in this repo

Budgets are not just reporting. They warn early and stop future turns once the limit is reached.

Use them when you want a hard product-level boundary on spending, not just lower average cost.

See Budgets for the full behaviour.

For providers that report cache activity, the cost dialog shows:

  • cache read tokens and cache write tokens broken out from regular input tokens
  • whether cache pricing was applied or was unavailable for each turn
  • estimated cache discount savings where cache pricing is known

For OpenAI-compatible providers, cached tokens are reported as a subset of input tokens and are normalized accordingly. DeepSeek prompt-cache hit tokens are reported as cache reads. Anthropic cache creation tokens are reported as cache writes. Gemini cached-content tokens are reported as cache reads.

For OpenAI Responses reasoning models, KodaCode defaults to stateless requests with responses_store: false and encrypted reasoning replay enabled. This adds reasoning.encrypted_content to eligible Responses requests, then stores and replays the returned encrypted reasoning item locally. That lets later tool-loop requests continue from the provider’s encrypted reasoning state instead of making the model reconstruct it from visible transcript alone.

providers:
- id: openai
responses_store: false
encrypted_reasoning_replay: true

Set encrypted_reasoning_replay: false to prevent KodaCode from requesting, persisting, or replaying encrypted OpenAI reasoning items.

Inside the TUI:

  • /cost opens the session cost dialog
  • /trace [turn-number] opens a per-turn detail view

Use /cost when you want the session-level answer:

Example Session Cost dialog showing token totals, cache activity, deterministic context, and savings mix.

LabelMeaning
Estimated session totalCurrent estimated spend for the session, or a priced subtotal when some model pricing is missing
Provider tokensProvider-reported input, output, cache, and reasoning token counts when available
Provider activityNumber of assistant roundtrips and provider calls
Batch efficiencyTool calls grouped into batches, plus the estimated provider calls avoided
Reported cache activityCache read/write tokens reported by the provider and whether cache pricing was applied
Deterministic contextInput tokens added by enabled context_packet sections, or omitted under input pressure
Estimated cumulative input savingsAvoided input tokens and estimated dollar savings from prompt, history, current-turn, and tool-catalog optimizations
Savings mixBreakdown of which automatic optimizations produced those avoided tokens
Highest priced turnThe turn that contributed the most estimated spend

This makes it possible to answer both:

  • “What did this session cost?”
  • “Which optimization actually saved money?”

Use /trace [turn-number] when you want the turn-level explanation:

Example Turn Trace dialog showing request mix, savings mix, likely spend drivers, and provider calls.

LabelMeaning
Estimated request mixHow much of the request was prompt, conversation replay, and tool surface
Dominant request driverThe largest contributor to request size, such as conversation replay or tool surface
Savings mixThe same avoided-token breakdown, scoped to that turn
Likely spend driversPlain-language reasons the turn was expensive or hard to price
Provider CallsPer-call model, duration, token counts, selected route, tool execution outcome, and request mix

If pricing is unavailable, token accounting still works. Dollar estimates are incomplete until the model has pricing metadata.

For the cost estimates to be useful, the runtime needs to know the model’s input and output price. Built-in providers populate this from the remote model catalog. For local or custom providers, use model_overrides:

model_overrides:
- ref: local/qwen2.5-coder-32b-instruct
name: Local Qwen Coder
context_size: 32768
tool_calls: true
cost_input: 0
cost_output: 0

Setting both costs to 0 means local turns appear in the cost dialog with zero estimated cost rather than missing pricing.