# Persist state across runs

> Pattern: keep agent state in external storage so a multi-turn conversation can resume on the next invocation.

<Callout label="Pattern guide">
  This is a pattern guide, not a runnable tutorial. The example below
  is a sketch; treat it as the architecture, not a copy-paste recipe.
</Callout>

CLRK sandboxes have no durable cross-worker persistent storage by
default. Each invocation starts from a fresh rootfs; `/tmp` and any
image-baked filesystem state reset on every run. (`TaskAgent.spec.state`
offers a worker-local mount shared across executions of the same agent
on a worker - fine for lighter cases; its backend today is `sqlite`.)
This guide covers durable,
cross-worker/cross-cluster state: if your agent needs to remember
anything between invocations - a conversation history, a workflow
checkpoint, a counter - that memory has to live outside the sandbox
in storage the agent reaches over egress.

This guide describes the pattern for the canonical case: a
Slack-thread-resuming agent built on LangGraph's Postgres
checkpointer.

## The pattern

```mermaid
flowchart TB
  S[Slack webhook] --> A[Auth proxy]
  A --> I[CLRK ingress]
  I --> X[Sandbox: agent.py]
  X -- "SELECT/INSERT" --> PG[(External Postgres<br/>LangGraph checkpoints)]
  X -- "chat.postMessage" --> SL[Slack API]
```

The agent is a thin compute layer over a database. Each invocation:

1. Loads state from the external store, keyed by something the
   trigger provides (a Slack thread ID, a ticket ID, a request ID).
2. Continues whatever logic that state represents - for LangGraph,
   that's resuming a graph at the last checkpoint.
3. Persists the updated state.
4. Replies (Slack message, HTTP response, downstream API call).
5. Exits.

The sandbox is short-lived and stateless. The conversation lives in
Postgres.

## Slack thread → LangGraph checkpoint

Slack's `event_callback` for a message event carries a `thread_ts`
that identifies the thread the message landed in. That's your stable
key. Use it as LangGraph's `thread_id`:

```python
#!/usr/bin/env python3
import json, os, sys, requests
from langgraph.graph import StateGraph
from langgraph.checkpoint.postgres import PostgresSaver

# Build the graph (your application logic).
builder = StateGraph(...)
# ... add nodes and edges ...
graph_def = builder

# Connect to the external Postgres via the egress allowlist.
checkpointer = PostgresSaver.from_conn_string(os.environ["DATABASE_URL"])
checkpointer.setup()
graph = graph_def.compile(checkpointer=checkpointer)

# Read the inbound Slack event from the CloudEvents envelope on stdin.
envelope = json.load(sys.stdin)
event = envelope["data"]["event"]
thread_id = event.get("thread_ts") or event["ts"]
user_text = event["text"]

# Resume at this thread's checkpoint and continue.
config = {"configurable": {"thread_id": thread_id}}
result = graph.invoke({"messages": [{"role": "user", "content": user_text}]}, config=config)

# Reply in the same Slack thread.
requests.post(
    "https://slack.com/api/chat.postMessage",
    json={
        "channel": event["channel"],
        "thread_ts": thread_id,
        "text": result["messages"][-1]["content"],
    },
    headers={"Authorization": "Bearer placeholder-injected-by-proxy"},
    timeout=10,
)

# Acknowledge the webhook.
print(json.dumps({"ok": True}))
```

The first time a thread sees a message, LangGraph creates a fresh
checkpoint row. Every subsequent message in the same thread resumes
from there. The sandbox doesn't need to know it's the second turn - 
the checkpointer handles continuity.

## CLRK pieces this pattern needs

- **An [auth proxy](./authenticate-users-before-agents)** in front
  of the CLRK ingress to validate Slack's signing secret. Forward
  the verified event to your TaskAgent with `X-Clrk-TaskAgent` set.
- **[Egress allowlist](./lock-down-agent-egress)** for the two
  external destinations the agent needs: your Postgres (by hostname
  or CIDR) and `slack.com:443`. Deny everything else by default.
- **[Credential injection](./hide-credentials-from-agents)** for the
  Slack bot token and the LLM provider key the agent calls. The
  Postgres password is typically supplied via the `DATABASE_URL`
  env var - for that one, use a literal `spec.template.spec.env`
  value (same nesting as `image`/`command`) pointing at a connection
  string that includes the password (rotate by re-applying), since
  CLRK's credential injection swaps HTTP headers, not Postgres
  protocol auth.

## Concurrency caveat

Slack fires events as fast as messages arrive. Two messages in the
same thread within a second is normal. If your agent has any
not-yet-committed side effect during a turn, a second concurrent
invocation can observe stale state and clobber the first turn's
update.

Two mitigations:

- **Use LangGraph's checkpointer transactions.** The Postgres
  checkpointer takes row-level locks around updates. Your application
  code on top has to commit its own side effects (Slack message,
  database writes) inside the same transaction or after the
  checkpoint commits, not before.
- **Cap concurrency.** Set `spec.maxConcurrent: 1` on the TaskAgent
  to serialize per-agent. Coarse but effective for low-throughput
  agents. For higher throughput, partition by thread ID at the
  ingress layer.

## What does NOT fit this pattern

- **Hot streaming back to Slack.** The TaskAgent response model is
  request/response. If you need to stream tokens to a Slack message
  as the LLM generates them, you need a DaemonAgent + its own
  outbound Slack connection, plus a queue between the webhook and
  the daemon.
- **Multi-MB conversation contexts.** A 5 MB JSON blob round-tripped
  to Postgres on every turn will dominate your latency. Either prune
  the context as it grows (LangGraph supports message-window
  summarization) or move bulky artifacts to S3 and store pointers
  in Postgres.
- **Cross-cluster failover with shared state.** External Postgres
  can replicate across clusters; CLRK doesn't help or hinder that.
  Standard database patterns apply.

## Alternative stores

The pattern (key in, state in, state out, key out) works against
whatever your team already runs:

- **Redis** when state is small and access is hot. Eviction makes it
  unsuitable as the only durable store; pair with Postgres as the
  authoritative copy.
- **S3 + versioning** when state is large, append-only, and you want
  an audit trail of every turn. Higher latency than Postgres.
- **DynamoDB / Spanner / Cosmos** for serverless / multi-region. Same
  pattern, different SDK.

LangGraph specifically ships checkpointers for Postgres and SQLite
in-tree; community checkpointers exist for Redis and others.

## Where to next

- Stand up the egress allowlist this pattern depends on - see [Lock
  down agent egress](./lock-down-agent-egress).
- Validate Slack's signing secret before the request reaches CLRK - 
  see [Authenticate users before
  agents](./authenticate-users-before-agents).
- Watch how this multi-turn flow looks in your observability backend
  - see [Trace requests through
  agents](./trace-requests-through-agents).
