Agent Lifecycle — Suspend, Resume & Recovery
Beyond the basic start / stop pair, Scion gives you finer control over an
agent’s lifecycle: you can suspend an agent and later resume it with its
harness conversation intact, recover an agent that crashed, and rely on the
Hub to auto-suspend agents that have stalled in order to reclaim resources.
This page is for power users driving agents from the CLI. For the conceptual model behind phases and activities, see Core Concepts: Agent State Model.
Stop vs. Suspend: two ways to wind down
Section titled “Stop vs. Suspend: two ways to wind down”Both stop and suspend tear down the agent’s container, but they record very
different intent:
scion stop | scion suspend | |
|---|---|---|
| Phase after | stopped | suspended |
Next start | Fresh harness session | Continues the previous conversation |
| Use when | The task is done, or you want a clean slate | You’ll come back and want the agent to pick up where it left off |
| Harness requirement | None | Harness must support session resume |
Suspend & Resume
Section titled “Suspend & Resume”Suspending an agent
Section titled “Suspending an agent”scion suspend <agent-name>This stops the agent’s container but marks its phase as suspended — a signal
that you intend to resume it later. Only a running agent can be suspended.
To suspend every running agent in the current project at once:
scion suspend --allSuspend requires a harness that supports session resume. If the agent’s harness
does not (for example, the generic harness), the command is rejected with an
error and you should use scion stop instead. When using --all, unsupported
agents are skipped rather than failing the whole batch.
Resuming an agent
Section titled “Resuming an agent”scion resume <agent-name> [task]resume re-launches the container and continues the prior harness conversation
by passing the harness-specific resume flag (--continue for Claude Code,
--resume for Gemini CLI, and so on). Any [task] arguments you supply are
appended to the resumed session as a new prompt, if the harness supports it.
| Flag | Description |
|---|---|
-a, --attach | Attach to the agent’s session immediately after resuming. |
start and resume are intent-aware
Section titled “start and resume are intent-aware”You do not have to remember which command to use — Scion looks at the agent’s saved phase and does the right thing:
scion starton a suspended agent performs an implicit resume: the harness session is continued, exactly as if you had runscion resume.scion resumeon a stopped agent starts a fresh session — there is no prior conversation to continue, so it falls back to a clean start.
In other words, the agent’s phase decides whether the session is continued or started fresh; the command name is just a hint.
Harness support
Section titled “Harness support”Session resume is a per-harness capability:
| Harness | Resume support |
|---|---|
| Claude Code | ✅ Yes (--continue) |
| Gemini CLI | ✅ Yes (--resume) |
| Generic | ❌ No — use stop/start |
Crash Recovery: the error phase
Section titled “Crash Recovery: the error phase”When an agent’s process or container exits non-zero — a real crash, an
out-of-memory kill, or a SIGKILL — the agent transitions to the error phase
with a descriptive message such as Agent crashed with exit code 137.
Scion is careful to distinguish a crash from an orderly shutdown. The harness
runs inside tmux, and sciontool recovers the real exit code when the session
ends, then classifies it:
| Outcome | Phase | Activity |
|---|---|---|
| Clean exit (code 0) | stopped | — |
| Limits reached (turns, model calls, or duration) | stopped | limits_exceeded |
Crash / OOM / SIGKILL (non-zero) | error | — (cleared) |
A crash surfaces as the error phase — the activity is cleared, and the
crash detail is carried in the agent’s message (e.g. Agent crashed with exit code 137). Two paths can set error: sciontool reports it from the recovered
exit code (the authoritative path), and the Hub also derives error from a
non-zero container exit reported in the broker heartbeat — which covers cases
where the container died before sciontool could report.
The error phase is restartable. Starting the agent again clears the error
and runs a fresh session:
scion start <agent-name>Because the crash discarded the previous run, this is a clean start rather than a session continuation.
Auto-Suspend of Stalled Agents
Section titled “Auto-Suspend of Stalled Agents”To reclaim resources from agents that are no longer making progress, the Hub can automatically suspend agents that have stalled.
An agent is marked stalled by the platform when its heartbeat is still being
received (the process is alive) but no activity events have arrived within the
stall threshold (default: 5 minutes). After it remains stalled for an
additional grace period (a further 5 minutes, so roughly 10 minutes of
inactivity in total), the Hub auto-suspends it — provided that:
- the agent’s harness supports session resume, and
- the container is still alive.
Auto-suspend uses the same machinery as a manual scion suspend, so the agent’s
phase becomes suspended and its harness session is preserved. The agent is
resumed automatically on the next message sent to it, continuing right where
it left off.
Runtime caveats
Section titled “Runtime caveats”Session continuation works only when the agent’s home directory — where the harness stores its conversation state — survives the container being reclaimed. Treat suspend/resume and auto-suspend session continuation as a Docker-proven capability, with this caveat for other runtimes:
- Docker — the proven path. The agent home is a host bind-mount that survives the container being reclaimed, so suspend/resume and auto-suspend continue the harness session with no additional configuration. The same holds for any setup with a persistent or NFS-backed home.
- Kubernetes / Cloud Run — these runtimes can have an ephemeral home. Without durable home persistence, resume restarts the container but the harness session may not continue. Durable home persistence (for example, object storage such as GCS) is future work, and these runtimes are gated on NFS-style persistence regardless — so do not assume full suspend/resume parity here yet.