Getting startedGuidesReferenceChangelog
Apoxy:// Docs / Guides / Persist state across runs

Persist state across runs

Pattern: keep agent state in external storage so a multi-turn conversation can resume on the next invocation.

Pattern guide

This is a pattern guide, not a runnable tutorial. The example below is a sketch; treat it as the architecture, not a copy-paste recipe.

CLRK sandboxes have no durable cross-worker persistent storage by default. Each invocation starts from a fresh rootfs; /tmp and any image-baked filesystem state reset on every run. (TaskAgent.spec.state offers a worker-local mount shared across executions of the same agent on a worker - fine for lighter cases; its backend today is sqlite.) This guide covers durable, cross-worker/cross-cluster state: if your agent needs to remember anything between invocations - a conversation history, a workflow checkpoint, a counter - that memory has to live outside the sandbox in storage the agent reaches over egress.

This guide describes the pattern for the canonical case: a Slack-thread-resuming agent built on LangGraph's Postgres checkpointer.

The pattern

$diagramMERMAID

The agent is a thin compute layer over a database. Each invocation:

  1. Loads state from the external store, keyed by something the trigger provides (a Slack thread ID, a ticket ID, a request ID).
  2. Continues whatever logic that state represents - for LangGraph, that's resuming a graph at the last checkpoint.
  3. Persists the updated state.
  4. Replies (Slack message, HTTP response, downstream API call).
  5. Exits.

The sandbox is short-lived and stateless. The conversation lives in Postgres.

Slack thread → LangGraph checkpoint

Slack's event_callback for a message event carries a thread_ts that identifies the thread the message landed in. That's your stable key. Use it as LangGraph's thread_id:

$terminalPY
#!/usr/bin/env python3 import json, os, sys, requests from langgraph.graph import StateGraph from langgraph.checkpoint.postgres import PostgresSaver # Build the graph (your application logic). builder = StateGraph(...) # ... add nodes and edges ... graph_def = builder # Connect to the external Postgres via the egress allowlist. checkpointer = PostgresSaver.from_conn_string(os.environ["DATABASE_URL"]) checkpointer.setup() graph = graph_def.compile(checkpointer=checkpointer) # Read the inbound Slack event from the CloudEvents envelope on stdin. envelope = json.load(sys.stdin) event = envelope["data"]["event"] thread_id = event.get("thread_ts") or event["ts"] user_text = event["text"] # Resume at this thread's checkpoint and continue. config = {"configurable": {"thread_id": thread_id}} result = graph.invoke({"messages": [{"role": "user", "content": user_text}]}, config=config) # Reply in the same Slack thread. requests.post( "https://slack.com/api/chat.postMessage", json={ "channel": event["channel"], "thread_ts": thread_id, "text": result["messages"][-1]["content"], }, headers={"Authorization": "Bearer placeholder-injected-by-proxy"}, timeout=10, ) # Acknowledge the webhook. print(json.dumps({"ok": True}))

The first time a thread sees a message, LangGraph creates a fresh checkpoint row. Every subsequent message in the same thread resumes from there. The sandbox doesn't need to know it's the second turn - the checkpointer handles continuity.

CLRK pieces this pattern needs

  • An auth proxy in front of the CLRK ingress to validate Slack's signing secret. Forward the verified event to your TaskAgent with X-Clrk-TaskAgent set.
  • Egress allowlist for the two external destinations the agent needs: your Postgres (by hostname or CIDR) and slack.com:443. Deny everything else by default.
  • Credential injection for the Slack bot token and the LLM provider key the agent calls. The Postgres password is typically supplied via the DATABASE_URL env var - for that one, use a literal spec.template.spec.env value (same nesting as image/command) pointing at a connection string that includes the password (rotate by re-applying), since CLRK's credential injection swaps HTTP headers, not Postgres protocol auth.

Concurrency caveat

Slack fires events as fast as messages arrive. Two messages in the same thread within a second is normal. If your agent has any not-yet-committed side effect during a turn, a second concurrent invocation can observe stale state and clobber the first turn's update.

Two mitigations:

  • Use LangGraph's checkpointer transactions. The Postgres checkpointer takes row-level locks around updates. Your application code on top has to commit its own side effects (Slack message, database writes) inside the same transaction or after the checkpoint commits, not before.
  • Cap concurrency. Set spec.maxConcurrent: 1 on the TaskAgent to serialize per-agent. Coarse but effective for low-throughput agents. For higher throughput, partition by thread ID at the ingress layer.

What does NOT fit this pattern

  • Hot streaming back to Slack. The TaskAgent response model is request/response. If you need to stream tokens to a Slack message as the LLM generates them, you need a DaemonAgent + its own outbound Slack connection, plus a queue between the webhook and the daemon.
  • Multi-MB conversation contexts. A 5 MB JSON blob round-tripped to Postgres on every turn will dominate your latency. Either prune the context as it grows (LangGraph supports message-window summarization) or move bulky artifacts to S3 and store pointers in Postgres.
  • Cross-cluster failover with shared state. External Postgres can replicate across clusters; CLRK doesn't help or hinder that. Standard database patterns apply.

Alternative stores

The pattern (key in, state in, state out, key out) works against whatever your team already runs:

  • Redis when state is small and access is hot. Eviction makes it unsuitable as the only durable store; pair with Postgres as the authoritative copy.
  • S3 + versioning when state is large, append-only, and you want an audit trail of every turn. Higher latency than Postgres.
  • DynamoDB / Spanner / Cosmos for serverless / multi-region. Same pattern, different SDK.

LangGraph specifically ships checkpointers for Postgres and SQLite in-tree; community checkpointers exist for Redis and others.

Where to next