Cap LLM spend per agent
Set token budgets that the egress proxy enforces. The agent sees a 429 when over budget; the denial lands in OTLP for alerting.
You don't want one stuck agent loop to burn $400 of Anthropic credit
overnight. The fix is a TokenBudgetFilter on the AIProviderRoute
the agent's traffic flows through. The proxy refuses overage requests
before they reach the provider, and the denial is observable.
What gets counted
TokenBudgetFilter lives on AIProviderRoute.spec.rules[].filters[].
One enforced knob, plus two accepted-but-not-yet-wired:
maxTokensPerDay- caps cumulative tokens across all executions hitting this route within a UTC calendar-day window. This is the only field enforced today (a counter per route, per EgressGateway, per UTC day). The right tool for "this agent should never cost more than X."maxTokensPerExecution- intended to cap total tokens (input + output) per agent execution. The field is accepted by the schema (manifests apply cleanly) but not enforced by any code path today - likeRateLimitPolicy, treat it as coming soon, not a working cap.maxOutputTokensPerRequest- intended to cap output tokens on a single API call. Also accepted by the schema but not enforced today; do not rely on it to bound model output yet.
Counts come from the provider's own usage fields. The Anthropic
parser reads usage.input_tokens and usage.output_tokens from the
response body; OpenAI and Google do the equivalent. Streaming
responses are attributed when the stream ends.
Worked example
Add a daily-token cap to the Anthropic route from Hide credentials from agents:
apiVersion: clrk.apoxy.dev/v1alpha1
kind: AIProviderRoute
metadata:
name: anthropic
spec:
parentRefs:
- group: clrk.apoxy.dev
kind: EgressGateway
name: echo-bot
rules:
- matches:
- provider: anthropic
endpoints:
- /v1/messages
filters:
- type: TokenBudget
tokenBudget:
maxTokensPerDay: 100000maxTokensPerDay is the only field that takes effect today, so the
example sets only that.
Apply, watch a few invocations burn tokens, then exceed the budget intentionally to see what happens.
What the agent sees over budget
The proxy returns HTTP 429 to the agent's outbound call with a body that names the route that blocked it:
HTTP/1.1 429 Too Many Requests
clrk: token budget exceeded for route anthropicYour agent code has to handle 429. For curl scripts: check the exit
status and the captured status code. For Python: catch the 429 and
either back off, fail the run with a clear error, or fall back to a
cheaper model.
What lands in OTLP
Every pre-flight denial emits an OTLP span carrying
clrk.budget.denied=true and two companion attributes:
clrk.budget.daily_used- current consumption inside the window.clrk.budget.daily_max- the cap that triggered the denial.
Wire your OTLP backend to alert on clrk.budget.denied=true. That's
your "an agent just hit its ceiling" signal. See Send telemetry to
OTLP endpoints for collector config.
When budgets bite at the wrong moment
- Streaming completions attribute at stream end. A streaming
response that pushes the route's daily total over
maxTokensPerDaystill completes (you can't refuse mid-stream); the next request on this route is denied once the daily counter is over cap. Size budgets with headroom - usage is accounted after the call finishes, so the cap can be crossed by the in-flight request. - Pre-flight checks the daily counter, not the request.
Pre-flight runs at request-headers time, before the body is
buffered, and compares only the route's running daily total against
maxTokensPerDay. It does not read the request's declaredmax_tokens, so an oversized single request is not rejected up front - it lands once the daily total is already over cap. - Window math is UTC-rolled-daily. "Daily" means a calendar window in UTC, not a 24-hour rolling window. A burn-down at 23:59 UTC followed by a burst at 00:01 will succeed; that's the point of the daily window. Use a smaller window if you need tighter bounding.
Designing budgets that hold
- Per-agent budgets: one
AIProviderRouteper agent (or per group of agents) so the per-day cap is scoped to the right unit. Sharing one route across many agents means they share the budget too. - A tight
maxTokensPerDayis your runaway-loop guard. It's the only enforced cap today, so an agent stuck in a recursive tool-call is bounded by the daily total - not by any per-execution limit. SetmaxTokensPerDaylow enough that a single runaway run can't exhaust a meaningful budget. (Revisit this oncemaxTokensPerExecutionis wired.) - Pair with shorter
spec.timeout. Budget caps tokens; timeouts cap wall-clock. Both matter: a slow agent burning tokens steadily needs both to stop.
A note on RateLimitPolicy
CLRK ships a RateLimitPolicy CRD for request-rate (rather than
token-volume) limits - coming soon, not enforced today. If you need
QPS-style limits on the egress side this quarter, talk to us. The
budget filter described above is what's wired and works.
Where to next
- Send budget-denied alerts to your pager - see Send telemetry to OTLP endpoints.
- Make sure the budget is the only knob between an agent and your bill - pair with Lock down agent egress so the agent can't simply call a different provider.
- Understand which inbound request triggered an over-budget call - see Trace requests through agents.