Persist state across runs
Pattern: keep agent state in external storage so a multi-turn conversation can resume on the next invocation.
This is a pattern guide, not a runnable tutorial. The example below is a sketch; treat it as the architecture, not a copy-paste recipe.
CLRK sandboxes have no durable cross-worker persistent storage by
default. Each invocation starts from a fresh rootfs; /tmp and any
image-baked filesystem state reset on every run. (TaskAgent.spec.state
offers a worker-local mount shared across executions of the same agent
on a worker - fine for lighter cases; its backend today is sqlite.)
This guide covers durable,
cross-worker/cross-cluster state: if your agent needs to remember
anything between invocations - a conversation history, a workflow
checkpoint, a counter - that memory has to live outside the sandbox
in storage the agent reaches over egress.
This guide describes the pattern for the canonical case: a Slack-thread-resuming agent built on LangGraph's Postgres checkpointer.
The pattern
The agent is a thin compute layer over a database. Each invocation:
- Loads state from the external store, keyed by something the trigger provides (a Slack thread ID, a ticket ID, a request ID).
- Continues whatever logic that state represents - for LangGraph, that's resuming a graph at the last checkpoint.
- Persists the updated state.
- Replies (Slack message, HTTP response, downstream API call).
- Exits.
The sandbox is short-lived and stateless. The conversation lives in Postgres.
Slack thread → LangGraph checkpoint
Slack's event_callback for a message event carries a thread_ts
that identifies the thread the message landed in. That's your stable
key. Use it as LangGraph's thread_id:
#!/usr/bin/env python3
import json, os, sys, requests
from langgraph.graph import StateGraph
from langgraph.checkpoint.postgres import PostgresSaver
# Build the graph (your application logic).
builder = StateGraph(...)
# ... add nodes and edges ...
graph_def = builder
# Connect to the external Postgres via the egress allowlist.
checkpointer = PostgresSaver.from_conn_string(os.environ["DATABASE_URL"])
checkpointer.setup()
graph = graph_def.compile(checkpointer=checkpointer)
# Read the inbound Slack event from the CloudEvents envelope on stdin.
envelope = json.load(sys.stdin)
event = envelope["data"]["event"]
thread_id = event.get("thread_ts") or event["ts"]
user_text = event["text"]
# Resume at this thread's checkpoint and continue.
config = {"configurable": {"thread_id": thread_id}}
result = graph.invoke({"messages": [{"role": "user", "content": user_text}]}, config=config)
# Reply in the same Slack thread.
requests.post(
"https://slack.com/api/chat.postMessage",
json={
"channel": event["channel"],
"thread_ts": thread_id,
"text": result["messages"][-1]["content"],
},
headers={"Authorization": "Bearer placeholder-injected-by-proxy"},
timeout=10,
)
# Acknowledge the webhook.
print(json.dumps({"ok": True}))The first time a thread sees a message, LangGraph creates a fresh checkpoint row. Every subsequent message in the same thread resumes from there. The sandbox doesn't need to know it's the second turn - the checkpointer handles continuity.
CLRK pieces this pattern needs
- An auth proxy in front
of the CLRK ingress to validate Slack's signing secret. Forward
the verified event to your TaskAgent with
X-Clrk-TaskAgentset. - Egress allowlist for the two
external destinations the agent needs: your Postgres (by hostname
or CIDR) and
slack.com:443. Deny everything else by default. - Credential injection for the
Slack bot token and the LLM provider key the agent calls. The
Postgres password is typically supplied via the
DATABASE_URLenv var - for that one, use a literalspec.template.spec.envvalue (same nesting asimage/command) pointing at a connection string that includes the password (rotate by re-applying), since CLRK's credential injection swaps HTTP headers, not Postgres protocol auth.
Concurrency caveat
Slack fires events as fast as messages arrive. Two messages in the same thread within a second is normal. If your agent has any not-yet-committed side effect during a turn, a second concurrent invocation can observe stale state and clobber the first turn's update.
Two mitigations:
- Use LangGraph's checkpointer transactions. The Postgres checkpointer takes row-level locks around updates. Your application code on top has to commit its own side effects (Slack message, database writes) inside the same transaction or after the checkpoint commits, not before.
- Cap concurrency. Set
spec.maxConcurrent: 1on the TaskAgent to serialize per-agent. Coarse but effective for low-throughput agents. For higher throughput, partition by thread ID at the ingress layer.
What does NOT fit this pattern
- Hot streaming back to Slack. The TaskAgent response model is request/response. If you need to stream tokens to a Slack message as the LLM generates them, you need a DaemonAgent + its own outbound Slack connection, plus a queue between the webhook and the daemon.
- Multi-MB conversation contexts. A 5 MB JSON blob round-tripped to Postgres on every turn will dominate your latency. Either prune the context as it grows (LangGraph supports message-window summarization) or move bulky artifacts to S3 and store pointers in Postgres.
- Cross-cluster failover with shared state. External Postgres can replicate across clusters; CLRK doesn't help or hinder that. Standard database patterns apply.
Alternative stores
The pattern (key in, state in, state out, key out) works against whatever your team already runs:
- Redis when state is small and access is hot. Eviction makes it unsuitable as the only durable store; pair with Postgres as the authoritative copy.
- S3 + versioning when state is large, append-only, and you want an audit trail of every turn. Higher latency than Postgres.
- DynamoDB / Spanner / Cosmos for serverless / multi-region. Same pattern, different SDK.
LangGraph specifically ships checkpointers for Postgres and SQLite in-tree; community checkpointers exist for Redis and others.
Where to next
- Stand up the egress allowlist this pattern depends on - see Lock down agent egress.
- Validate Slack's signing secret before the request reaches CLRK - see Authenticate users before agents.
- Watch how this multi-turn flow looks in your observability backend