Getting startedGuidesReferenceChangelog
Apoxy:// Docs / Guides / Cap LLM spend per agent

Cap LLM spend per agent

Set token budgets that the egress proxy enforces. The agent sees a 429 when over budget; the denial lands in OTLP for alerting.

You don't want one stuck agent loop to burn $400 of Anthropic credit overnight. The fix is a TokenBudgetFilter on the AIProviderRoute the agent's traffic flows through. The proxy refuses overage requests before they reach the provider, and the denial is observable.

What gets counted

TokenBudgetFilter lives on AIProviderRoute.spec.rules[].filters[]. One enforced knob, plus two accepted-but-not-yet-wired:

  • maxTokensPerDay - caps cumulative tokens across all executions hitting this route within a UTC calendar-day window. This is the only field enforced today (a counter per route, per EgressGateway, per UTC day). The right tool for "this agent should never cost more than X."
  • maxTokensPerExecution - intended to cap total tokens (input + output) per agent execution. The field is accepted by the schema (manifests apply cleanly) but not enforced by any code path today - like RateLimitPolicy, treat it as coming soon, not a working cap.
  • maxOutputTokensPerRequest - intended to cap output tokens on a single API call. Also accepted by the schema but not enforced today; do not rely on it to bound model output yet.

Counts come from the provider's own usage fields. The Anthropic parser reads usage.input_tokens and usage.output_tokens from the response body; OpenAI and Google do the equivalent. Streaming responses are attributed when the stream ends.

Worked example

Add a daily-token cap to the Anthropic route from Hide credentials from agents:

$terminalYAML
apiVersion: clrk.apoxy.dev/v1alpha1 kind: AIProviderRoute metadata: name: anthropic spec: parentRefs: - group: clrk.apoxy.dev kind: EgressGateway name: echo-bot rules: - matches: - provider: anthropic endpoints: - /v1/messages filters: - type: TokenBudget tokenBudget: maxTokensPerDay: 100000

maxTokensPerDay is the only field that takes effect today, so the example sets only that.

Apply, watch a few invocations burn tokens, then exceed the budget intentionally to see what happens.

What the agent sees over budget

The proxy returns HTTP 429 to the agent's outbound call with a body that names the route that blocked it:

$terminalTXT
HTTP/1.1 429 Too Many Requests clrk: token budget exceeded for route anthropic

Your agent code has to handle 429. For curl scripts: check the exit status and the captured status code. For Python: catch the 429 and either back off, fail the run with a clear error, or fall back to a cheaper model.

What lands in OTLP

Every pre-flight denial emits an OTLP span carrying clrk.budget.denied=true and two companion attributes:

  • clrk.budget.daily_used - current consumption inside the window.
  • clrk.budget.daily_max - the cap that triggered the denial.

Wire your OTLP backend to alert on clrk.budget.denied=true. That's your "an agent just hit its ceiling" signal. See Send telemetry to OTLP endpoints for collector config.

When budgets bite at the wrong moment

  • Streaming completions attribute at stream end. A streaming response that pushes the route's daily total over maxTokensPerDay still completes (you can't refuse mid-stream); the next request on this route is denied once the daily counter is over cap. Size budgets with headroom - usage is accounted after the call finishes, so the cap can be crossed by the in-flight request.
  • Pre-flight checks the daily counter, not the request. Pre-flight runs at request-headers time, before the body is buffered, and compares only the route's running daily total against maxTokensPerDay. It does not read the request's declared max_tokens, so an oversized single request is not rejected up front - it lands once the daily total is already over cap.
  • Window math is UTC-rolled-daily. "Daily" means a calendar window in UTC, not a 24-hour rolling window. A burn-down at 23:59 UTC followed by a burst at 00:01 will succeed; that's the point of the daily window. Use a smaller window if you need tighter bounding.

Designing budgets that hold

  • Per-agent budgets: one AIProviderRoute per agent (or per group of agents) so the per-day cap is scoped to the right unit. Sharing one route across many agents means they share the budget too.
  • A tight maxTokensPerDay is your runaway-loop guard. It's the only enforced cap today, so an agent stuck in a recursive tool-call is bounded by the daily total - not by any per-execution limit. Set maxTokensPerDay low enough that a single runaway run can't exhaust a meaningful budget. (Revisit this once maxTokensPerExecution is wired.)
  • Pair with shorter spec.timeout. Budget caps tokens; timeouts cap wall-clock. Both matter: a slow agent burning tokens steadily needs both to stop.

A note on RateLimitPolicy

CLRK ships a RateLimitPolicy CRD for request-rate (rather than token-volume) limits - coming soon, not enforced today. If you need QPS-style limits on the egress side this quarter, talk to us. The budget filter described above is what's wired and works.

Where to next