# Cap LLM spend per agent

> Set token budgets that the egress proxy enforces. The agent sees a 429 when over budget; the denial lands in OTLP for alerting.

You don't want one stuck agent loop to burn $400 of Anthropic credit
overnight. The fix is a `TokenBudgetFilter` on the `AIProviderRoute`
the agent's traffic flows through. The proxy refuses overage requests
before they reach the provider, and the denial is observable.

## What gets counted

`TokenBudgetFilter` lives on `AIProviderRoute.spec.rules[].filters[]`.
One enforced knob, plus two accepted-but-not-yet-wired:

- **`maxTokensPerDay`** - caps cumulative tokens across all
  executions hitting this route within a UTC calendar-day window.
  This is the only field enforced today (a counter per route, per
  EgressGateway, per UTC day). The right tool for "this agent should
  never cost more than X."
- **`maxTokensPerExecution`** - intended to cap total tokens (input +
  output) per agent execution. The field is accepted by the schema
  (manifests apply cleanly) but **not enforced by any code path
  today** - like `RateLimitPolicy`, treat it as coming soon, not a
  working cap.
- **`maxOutputTokensPerRequest`** - intended to cap output tokens on
  a single API call. Also accepted by the schema but **not enforced
  today**; do not rely on it to bound model output yet.

Counts come from the provider's own usage fields. The Anthropic
parser reads `usage.input_tokens` and `usage.output_tokens` from the
response body; OpenAI and Google do the equivalent. Streaming
responses are attributed when the stream ends.

## Worked example

Add a daily-token cap to the Anthropic route from [Hide credentials
from agents](./hide-credentials-from-agents):

```yaml
apiVersion: clrk.apoxy.dev/v1alpha1
kind: AIProviderRoute
metadata:
  name: anthropic
spec:
  parentRefs:
    - group: clrk.apoxy.dev
      kind: EgressGateway
      name: echo-bot
  rules:
    - matches:
        - provider: anthropic
          endpoints:
            - /v1/messages
      filters:
        - type: TokenBudget
          tokenBudget:
            maxTokensPerDay: 100000
```

`maxTokensPerDay` is the only field that takes effect today, so the
example sets only that.

Apply, watch a few invocations burn tokens, then exceed the budget
intentionally to see what happens.

## What the agent sees over budget

The proxy returns HTTP 429 to the agent's outbound call with a
body that names the route that blocked it:

```
HTTP/1.1 429 Too Many Requests

clrk: token budget exceeded for route anthropic
```

Your agent code has to handle 429. For curl scripts: check the exit
status and the captured status code. For Python: catch the `429` and
either back off, fail the run with a clear error, or fall back to a
cheaper model.

## What lands in OTLP

Every pre-flight denial emits an OTLP span carrying
`clrk.budget.denied=true` and two companion attributes:

- `clrk.budget.daily_used` - current consumption inside the window.
- `clrk.budget.daily_max` - the cap that triggered the denial.

Wire your OTLP backend to alert on `clrk.budget.denied=true`. That's
your "an agent just hit its ceiling" signal. See [Send telemetry to
OTLP endpoints](./send-telemetry-to-otlp) for collector config.

## When budgets bite at the wrong moment

- **Streaming completions attribute at stream end.** A streaming
  response that pushes the route's daily total over
  `maxTokensPerDay` still completes (you can't refuse mid-stream);
  the next request on this route is denied once the daily counter is
  over cap. Size budgets with headroom - usage is accounted after the
  call finishes, so the cap can be crossed by the in-flight request.
- **Pre-flight checks the daily counter, not the request.**
  Pre-flight runs at request-headers time, before the body is
  buffered, and compares only the route's running daily total against
  `maxTokensPerDay`. It does not read the request's declared
  `max_tokens`, so an oversized single request is not rejected
  up front - it lands once the daily total is already over cap.
- **Window math is UTC-rolled-daily.** "Daily" means a calendar
  window in UTC, not a 24-hour rolling window. A burn-down at 23:59
  UTC followed by a burst at 00:01 will succeed; that's the point of
  the daily window. Use a smaller window if you need tighter
  bounding.

## Designing budgets that hold

- **Per-agent budgets**: one `AIProviderRoute` per agent (or per
  group of agents) so the per-day cap is scoped to the right unit.
  Sharing one route across many agents means they share the budget
  too.
- **A tight `maxTokensPerDay` is your runaway-loop guard.** It's the
  only enforced cap today, so an agent stuck in a recursive tool-call
  is bounded by the daily total - not by any per-execution limit.
  Set `maxTokensPerDay` low enough that a single runaway run can't
  exhaust a meaningful budget. (Revisit this once
  `maxTokensPerExecution` is wired.)
- **Pair with shorter `spec.timeout`.** Budget caps tokens; timeouts
  cap wall-clock. Both matter: a slow agent burning tokens steadily
  needs both to stop.

## A note on `RateLimitPolicy`

CLRK ships a `RateLimitPolicy` CRD for request-rate (rather than
token-volume) limits - coming soon, not enforced today. If you need
QPS-style limits on the egress side this quarter, talk to us. The
budget filter described above is what's wired and works.

## Where to next

- Send budget-denied alerts to your pager - see [Send telemetry to
  OTLP endpoints](./send-telemetry-to-otlp).
- Make sure the budget is the only knob between an agent and your
  bill - pair with [Lock down agent
  egress](./lock-down-agent-egress) so the agent can't simply call a
  different provider.
- Understand which inbound request triggered an over-budget call - 
  see [Trace requests through agents](./trace-requests-through-agents).
