Steps to reproduce
Summary
When a dstack run provisions a new EC2 instance (no idle instance available in the fleet), the volume attachment fails because dstack attempts to call the AWS AttachVolume API before the instance reaches the running state. The AWS API returns Client.IncorrectState - Instance is not 'running', which is not handled by the volume attachment code, causing the job to immediately fail with VOLUME_ERROR (Failed to attach volume).
Environment
- dstack version: 0.20.11 (server and client)
- Backend: AWS (
eu-central-1)
- Instance types tested:
g4dn.xlarge, g4dn.2xlarge, g6.xlarge
- Volume: EBS gp3, 500GB,
eu-central-1a
Steps to Reproduce
-
Create a volume in eu-central-1:
type: volume
name: my-data-volume
backend: aws
region: eu-central-1
size: 500GB
-
Create a fleet with nodes: 0..2 (allowing zero minimum instances so no idle instance is guaranteed):
type: fleet
name: my-fleet
nodes: 0..2
idle_duration: 1h
resources:
gpu: 0..2
-
Ensure no idle instances exist in the fleet (wait for idle timeout or start fresh).
-
Submit a dev-environment run that references the volume:
type: dev-environment
name: my-vscode-dev
fleets:
- my-fleet
ide: vscode
image: dstackai/dind
privileged: true
resources:
disk:
size: 150GB
volumes:
- name: my-data-volume
path: /workspace/data
backends:
- aws
regions:
- eu-central-1
instance_types:
- g4dn.xlarge
-
The run fails immediately after instance creation with volume error.
Actual behaviour
Actual Behaviour
The job transitions through these states in rapid succession (~10 seconds total):
SUBMITTED -> PROVISIONING -> TERMINATING (VOLUME_ERROR: Failed to attach volume) -> FAILED
The job_runtime_data.volume_names is [] (empty), confirming the volume was never attached.
AWS CloudTrail shows the AttachVolume API call failing:
Error: Client.IncorrectState - Instance 'i-0fe3b3b890bcee277' is not 'running'.
Timeline from a real failure:
10:25:42 - Job submitted
10:25:48 - AWS AttachVolume API call -> FAILS (instance not running)
10:25:52 - Job marked as failed
The run inspect shows the instance was correctly provisioned in eu-central-1a (same AZ as the volume), the volume is available (not attached elsewhere), yet the attachment fails every time.
Retrying does not help because each retry creates a new instance and hits the same race condition.
Expected behaviour
dstack should wait for the EC2 instance to reach the running state before attempting to attach the EBS volume, or retry the AttachVolume call when receiving Client.IncorrectState.
When an idle instance already exists in the fleet (already in running state), the same volume and configuration work without any issues.
dstack version
0.20.11
Server logs
dstack event log for the failing run:
[2026-02-24 10:25:42] [run markus-bauer-vscode-dev] Run submitted. Status: SUBMITTED
[2026-02-24 10:25:48] [job markus-bauer-vscode-dev-0-0] Job status changed SUBMITTED -> PROVISIONING
[2026-02-24 10:25:48] [job markus-bauer-vscode-dev-0-0, instance maximilian-zenk-dev-1] Instance created for job
[2026-02-24 10:25:48] [job markus-bauer-vscode-dev-0-0] Job status changed PROVISIONING -> TERMINATING. Termination reason: VOLUME_ERROR (Failed to attach volume)
[2026-02-24 10:25:52] [job markus-bauer-vscode-dev-0-0] Job status changed TERMINATING -> FAILED
[2026-02-24 10:25:52] [run markus-bauer-vscode-dev] Run status changed SUBMITTED -> TERMINATING. Termination reason: JOB_FAILED
AWS CloudTrail `AttachVolume` events (all failed with the same error):
Time: 2026-02-24T10:25:48 Instance: i-0fe3b3b890bcee277 Volume: vol-0dbfe3ed743a31ee9
Error: Client.IncorrectState - Instance 'i-0fe3b3b890bcee277' is not 'running'.
Time: 2026-02-24T10:24:08 Instance: i-034ea6a1a4140200c Volume: vol-0dbfe3ed743a31ee9
Error: Client.IncorrectState - Instance 'i-034ea6a1a4140200c' is not 'running'.
Time: 2026-02-24T10:29:13 Instance: i-0e1db3e0658ed6b62 Volume: vol-0dbfe3ed743a31ee9
Error: Client.IncorrectState - Instance 'i-0e1db3e0658ed6b62' is not 'running'.
Time: 2026-02-24T10:31:37 Instance: i-0279c74e3b848f081 Volume: vol-0dbfe3ed743a31ee9
Error: Client.IncorrectState - Instance 'i-0279c74e3b848f081' is not 'running'.
Additional information
Root cause in code
The issue is in dstack/_internal/core/backends/aws/compute.py in the attach_volume method. The method calls ec2_client.attach_volume() but only handles these specific ClientError codes:
VolumeInUse
InvalidVolume.ZoneMismatch
InvalidVolume.NotFound
InvalidParameterValue (for device name conflicts)
It does not handle Client.IncorrectState, so the error propagates up as an unhandled botocore.exceptions.ClientError, which is caught by the generic exception handler in _attach_volumes() (submitted_jobs.py) and terminates the job.
Suggested fix
Either:
-
Add a waiter or polling loop in attach_volume to wait for the instance to reach running state before calling ec2_client.attach_volume():
ec2_client.get_waiter('instance_running').wait(InstanceIds=[instance_id])
-
Or handle Client.IncorrectState with a retry + backoff in the attach_volume method.