[Bug]: Volume attachment fails with Client.IncorrectState when instance is not yet running

Steps to reproduce

Summary

When a dstack run provisions a new EC2 instance (no idle instance available in the fleet), the volume attachment fails because dstack attempts to call the AWS AttachVolume API before the instance reaches the running state. The AWS API returns Client.IncorrectState - Instance is not 'running', which is not handled by the volume attachment code, causing the job to immediately fail with VOLUME_ERROR (Failed to attach volume).

Environment

dstack version: 0.20.11 (server and client)
Backend: AWS (eu-central-1)
Instance types tested: g4dn.xlarge, g4dn.2xlarge, g6.xlarge
Volume: EBS gp3, 500GB, eu-central-1a

Steps to Reproduce

Create a volume in eu-central-1:

type: volume
name: my-data-volume
backend: aws
region: eu-central-1
size: 500GB

Create a fleet with nodes: 0..2 (allowing zero minimum instances so no idle instance is guaranteed):
```
type: fleet
name: my-fleet
nodes: 0..2
idle_duration: 1h
resources:
  gpu: 0..2
```
Ensure no idle instances exist in the fleet (wait for idle timeout or start fresh).

Submit a dev-environment run that references the volume:

type: dev-environment
name: my-vscode-dev
fleets:
  - my-fleet
ide: vscode
image: dstackai/dind
privileged: true
resources:
  disk:
    size: 150GB
volumes:
  - name: my-data-volume
    path: /workspace/data
backends:
  - aws
regions:
  - eu-central-1
instance_types:
  - g4dn.xlarge

The run fails immediately after instance creation with volume error.

Actual behaviour

Actual Behaviour

The job transitions through these states in rapid succession (~10 seconds total):

SUBMITTED -> PROVISIONING -> TERMINATING (VOLUME_ERROR: Failed to attach volume) -> FAILED

The job_runtime_data.volume_names is [] (empty), confirming the volume was never attached.

AWS CloudTrail shows the AttachVolume API call failing:

Error: Client.IncorrectState - Instance 'i-0fe3b3b890bcee277' is not 'running'.

Timeline from a real failure:

10:25:42 - Job submitted
10:25:48 - AWS AttachVolume API call -> FAILS (instance not running)
10:25:52 - Job marked as failed

The run inspect shows the instance was correctly provisioned in eu-central-1a (same AZ as the volume), the volume is available (not attached elsewhere), yet the attachment fails every time.

Retrying does not help because each retry creates a new instance and hits the same race condition.

Expected behaviour

dstack should wait for the EC2 instance to reach the running state before attempting to attach the EBS volume, or retry the AttachVolume call when receiving Client.IncorrectState.

When an idle instance already exists in the fleet (already in running state), the same volume and configuration work without any issues.

dstack version

0.20.11

Server logs

dstack event log for the failing run:


[2026-02-24 10:25:42] [run markus-bauer-vscode-dev] Run submitted. Status: SUBMITTED
[2026-02-24 10:25:48] [job markus-bauer-vscode-dev-0-0] Job status changed SUBMITTED -> PROVISIONING
[2026-02-24 10:25:48] [job markus-bauer-vscode-dev-0-0, instance maximilian-zenk-dev-1] Instance created for job
[2026-02-24 10:25:48] [job markus-bauer-vscode-dev-0-0] Job status changed PROVISIONING -> TERMINATING. Termination reason: VOLUME_ERROR (Failed to attach volume)
[2026-02-24 10:25:52] [job markus-bauer-vscode-dev-0-0] Job status changed TERMINATING -> FAILED
[2026-02-24 10:25:52] [run markus-bauer-vscode-dev] Run status changed SUBMITTED -> TERMINATING. Termination reason: JOB_FAILED


AWS CloudTrail `AttachVolume` events (all failed with the same error):


Time: 2026-02-24T10:25:48  Instance: i-0fe3b3b890bcee277  Volume: vol-0dbfe3ed743a31ee9
Error: Client.IncorrectState - Instance 'i-0fe3b3b890bcee277' is not 'running'.

Time: 2026-02-24T10:24:08  Instance: i-034ea6a1a4140200c  Volume: vol-0dbfe3ed743a31ee9
Error: Client.IncorrectState - Instance 'i-034ea6a1a4140200c' is not 'running'.

Time: 2026-02-24T10:29:13  Instance: i-0e1db3e0658ed6b62  Volume: vol-0dbfe3ed743a31ee9
Error: Client.IncorrectState - Instance 'i-0e1db3e0658ed6b62' is not 'running'.

Time: 2026-02-24T10:31:37  Instance: i-0279c74e3b848f081  Volume: vol-0dbfe3ed743a31ee9
Error: Client.IncorrectState - Instance 'i-0279c74e3b848f081' is not 'running'.

Additional information

Root cause in code

The issue is in dstack/_internal/core/backends/aws/compute.py in the attach_volume method. The method calls ec2_client.attach_volume() but only handles these specific ClientError codes:

VolumeInUse
InvalidVolume.ZoneMismatch
InvalidVolume.NotFound
InvalidParameterValue (for device name conflicts)

It does not handle Client.IncorrectState, so the error propagates up as an unhandled botocore.exceptions.ClientError, which is caught by the generic exception handler in _attach_volumes() (submitted_jobs.py) and terminates the job.

Suggested fix

Either:

Add a waiter or polling loop in attach_volume to wait for the instance to reach running state before calling ec2_client.attach_volume():
```
ec2_client.get_waiter('instance_running').wait(InstanceIds=[instance_id])
```
Or handle Client.IncorrectState with a retry + backoff in the attach_volume method.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: Volume attachment fails with Client.IncorrectState when instance is not yet running #3603

Steps to reproduce

Summary

Environment

Steps to Reproduce

Actual behaviour

Actual Behaviour

Expected behaviour

dstack version

Server logs

Additional information

Root cause in code

Suggested fix

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Bug]: Volume attachment fails with Client.IncorrectState when instance is not yet running #3603

Description

Steps to reproduce

Summary

Environment

Steps to Reproduce

Actual behaviour

Actual Behaviour

Expected behaviour

dstack version

Server logs

Additional information

Root cause in code

Suggested fix

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions