Skip to content

Running Scion on Kubernetes

Scion supports running agents as Pods in a Kubernetes cluster. This enables remote execution, resource management, and scaling beyond a single machine.

  • A running Kubernetes cluster (GKE, EKS, AKS, or self-managed).
  • kubectl configured with access to the target cluster.
  • Scion agent images available to the cluster (pushed to a container registry accessible by the cluster).
  • RBAC permissions as described in the Required Permissions section.

Use scion doctor to verify prerequisites before starting agents.

Configure the Kubernetes runtime in your global ~/.scion/settings.yaml:

runtimes:
k8s:
type: kubernetes
context: my-cluster-context # kubectl context (optional, defaults to current)
namespace: scion-agents # target namespace (default: "default")
gke: false # enable GKE-specific features
list_all_namespaces: false # list agents across all namespaces
profiles:
default:
runtime: k8s

Per-agent or per-template Kubernetes settings in ~/.scion/settings.yaml:

kubernetes:
namespace: custom-namespace # override runtime namespace
context: alternate-context # override runtime context
serviceAccountName: agent-sa # Workload Identity / IRSA
runtimeClassName: gvisor # sandboxed runtime (gVisor, Kata, etc.)
imagePullPolicy: IfNotPresent # Always, IfNotPresent, or Never
nodeSelector:
pool: agents
accelerator: gpu
tolerations:
- key: dedicated
operator: Equal
value: agents
effect: NoSchedule
resources:
requests:
nvidia.com/gpu: "1"
limits:
nvidia.com/gpu: "1"

Standard compute resources use the common resources field:

resources:
requests:
cpu: "500m"
memory: "1Gi"
limits:
cpu: "2"
memory: "4Gi"
disk: "20Gi" # maps to ephemeral-storage (both requests and limits)

Extended resources (GPUs, custom devices) use kubernetes.resources.

When running in Google Kubernetes Engine (GKE), Scion natively supports Workload Identity for secure access to GCP APIs (like Vertex AI or Cloud Storage) without passing long-lived service account keys.

  1. Enable the gke: true flag in your runtime configuration.
  2. Ensure your cluster is configured with Workload Identity.
  3. Bind a Kubernetes Service Account to a Google Service Account.
  4. Set the serviceAccountName in the agent’s Kubernetes configuration to match the bound KSA.

This provides the agent container with an ambient identity, which the underlying harness (e.g., Gemini or Claude via Vertex) can automatically resolve using Application Default Credentials (ADC).

Volume TypeStatusNotes
EmptyDir (workspace)SupportedDefault workspace volume, always created
GCS FUSE CSISupportedRequires gcsfuse.csi.storage.gke.io CSI driver; GKE only
Local/bind-mountNot supportedLogged as warning, skipped. Use tar sync instead
PersistentVolumeClaimNot supportedFuture enhancement
ModeStatusPrerequisites
Native K8s SecretSupported (default)Secret create/delete RBAC
GKE Secret Store CSISupportedgke: true, Secrets Store CSI Driver + GCP provider, SecretProviderClass CRD
ResolvedAuth filesSupportedInjected via K8s Secret volumes (not hostPath)

Secrets are composable: ResolvedAuth and ResolvedSecrets are applied independently (not mutually exclusive).

ModeStatusNotes
Tar snapshotSupportedDefault. Full workspace snapshot via pods/exec streaming
GCS volume syncSupportedFor GCS-mounted volumes via gcloud storage rsync

Tar sync includes retry with exponential backoff (1s, 2s, 4s — up to 3 retries) for transient errors (connection resets, broken pipes, timeouts).

FeatureStatus
Resource requests/limitsSupported
Extended resources (GPUs)Supported
Ephemeral storage (disk)Supported (requests + limits)
RuntimeClassNameSupported
ServiceAccountNameSupported
NodeSelectorSupported
TolerationsSupported
ImagePullPolicySupported (Always, IfNotPresent, Never)
FSGroup security contextSupported (auto-set from host GID)
FeatureStatus
Default namespaceSupported
Per-agent namespaceSupported (via config or labels)
Multi-namespace listingSupported (list_all_namespaces: true)
Namespace/pod ID formatSupported (namespace/podname for all operations)
Namespace annotationSupported (scion.namespace persisted on pod)

The user or service account running scion needs the following RBAC permissions in the target namespace:

ResourceVerbs
podscreate, get, list, delete
pods/execcreate
pods/logget
secretscreate, list, delete
ResourceVerbs
secretproviderclasses (secrets-store.csi.x-k8s.io)create, list, delete
ResourceVerbs
namespacesget, list
pods (cluster-wide)list
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: scion-agent-manager
rules:
- apiGroups: [""]
resources: ["pods"]
verbs: ["create", "get", "list", "delete"]
- apiGroups: [""]
resources: ["pods/exec", "pods/log"]
verbs: ["create", "get"]
- apiGroups: [""]
resources: ["secrets"]
verbs: ["create", "list", "delete"]
- apiGroups: [""]
resources: ["namespaces"]
verbs: ["get", "list"]
  1. Start: scion start creates a Pod with the configured image, resources, and secrets.
  2. Sync: Workspace and agent home are transferred to the Pod via tar streaming over pods/exec.
  3. Ready: Pod readiness is polled with detailed error classification (image pull, scheduling, config errors).
  4. Attach: scion attach connects to the tmux session inside the Pod via pods/exec.
  5. Sync back: scion sync from <agent> retrieves workspace changes via tar streaming.
  6. Delete: scion rm <agent> deletes the Pod and associated Secrets/SecretProviderClasses.

Run scion doctor to verify your Kubernetes runtime configuration:

Terminal window
scion doctor

This checks:

  • Cluster connectivity and authentication
  • Namespace existence and access
  • Pod CRUD and exec permissions
  • Secret management permissions
  • (GKE mode) SecretProviderClass CRD availability
  • (GKE mode) Secrets Store CSI driver installation
  • (GKE mode) GCS FUSE CSI driver installation

Use scion doctor --format json for machine-readable output.

The Kubernetes runtime provides structured error messages with remediation hints:

ErrorRemediation
ImagePullBackOff / ErrImagePullVerify image name and registry access; check imagePullPolicy
InvalidImageNameCheck image name format
CreateContainerConfigErrorCheck secret references and volume mounts
CrashLoopBackOffCheck container logs with scion logs
UnschedulableCheck node selectors, tolerations, and resource availability
Invalid resource valuesError includes the field name and invalid value
  • Workspace sync uses tar snapshots (not live filesystem). Changes require explicit scion sync.
  • Local/bind-mount volumes are not supported on remote clusters.
  • Pod networking depends on cluster CNI configuration.
  • Authentication credentials must be propagated via Secrets or Workload Identity.