I have configured an ECS service running on a g4dn.xlarge instance which has a single GPU. Inside the task definition I specify the container definition resource requirement to use one GPU as such: <pre class="prettyprint"><code>"resourceRequirements": [ { "type":"GPU", "value": "1" } ] </code></pre> Running one task and one container on this instance works fine. When I set the service's desired task count to 2, I receive an event on the service that states: <blockquote> service was unable to place a task because no container instance met all of its requirements. The closest matching container-instance has insufficient GPU resource available. </blockquote> According to the AWS docs: <blockquote> Amazon ECS will schedule to available GPU-enabled container instances and pin physical GPUs to proper containers for optimal performance. </blockquote> If there any way to override this default behavior and force ECS to allow multiple container instances to share a single GPU? I don't believe we will run into issues with performance on sharing as we plan to use the each container for H264 encoding (nvenc) which is not CUDA. If anyone can direct me to documentation concerning performance of CUDA on containers sharing a GPU, that would also be appreciated.

The tricks is to enable nvidia docker runtime by default for all containers if it is suitable for your use Base on an Amazon AMI <code>amazon/amzn2-ami-ecs-gpu-hvm-2.0.20200218-x86_64-ebs</code>, connect to the instance and add the configuration below : <pre class="prettyprint"><code>sudo cat <<"EOF" > /etc/docker/daemon.json { "default-runtime": "nvidia", "runtimes": { "nvidia": { "path": "/etc/docker-runtimes.d/nvidia" } } } EOF sudo pkill -SIGHUP dockerd tail -10 /var/log/messages </code></pre> Create a new AMI and don't specify any values on GPU container definition.

Multiple Containers Sharing Single GPU

Tags:

amazon-web-services

amazon-ec2

gpu

amazon-ecs

I have configured an ECS service running on a g4dn.xlarge instance which has a single GPU. Inside the task definition I specify the container definition resource requirement to use one GPU as such:

"resourceRequirements": [
  {
    "type":"GPU",
    "value": "1"
  }
]

Running one task and one container on this instance works fine. When I set the service's desired task count to 2, I receive an event on the service that states:

service was unable to place a task because no container instance met all of its requirements. The closest matching container-instance has insufficient GPU resource available.

According to the AWS docs:

Amazon ECS will schedule to available GPU-enabled container instances and pin physical GPUs to proper containers for optimal performance.

If there any way to override this default behavior and force ECS to allow multiple container instances to share a single GPU?

I don't believe we will run into issues with performance on sharing as we plan to use the each container for H264 encoding (nvenc) which is not CUDA. If anyone can direct me to documentation concerning performance of CUDA on containers sharing a GPU, that would also be appreciated.

231

asked Jan 10 '20 18:01

Mike

1 Answers

The tricks is to enable nvidia docker runtime by default for all containers if it is suitable for your use

Base on an Amazon AMI amazon/amzn2-ami-ecs-gpu-hvm-2.0.20200218-x86_64-ebs, connect to the instance and add the configuration below :

sudo cat <<"EOF" > /etc/docker/daemon.json
{
  "default-runtime": "nvidia",
  "runtimes": {
      "nvidia": {
        "path": "/etc/docker-runtimes.d/nvidia"
      }
  }
}
EOF
sudo pkill -SIGHUP dockerd
tail -10 /var/log/messages

Create a new AMI and don't specify any values on GPU container definition.

140

answered Oct 21 '22 12:10

flow

Related questions
                            
                                How can I create a custom metric watching EFS metered size in AWS Cloudwatch?
                            
                                ALLOWED_HOSTS not working in my Django App deployed to Elastic Beanstalk
                            
                                aws network elb not generating logs
                            
                                How to install libcurl with nss backend in aws ec2? (Python 3.6 64bit Amazon Linux)
                            
                                AWS Cognito - Logging end user activities for auditing
                            
                                Presigned POST URLs work locally but not in Lambda
                            
                                Should I really use one DynamoDB table for all data?
                            
                                'no SavedModel bundles found!' on tensorflow_hub model deployment to AWS SageMaker
                            
                                Is there any mock(or local) service of aurora serverless Data Api?
                            
                                Spring Boot over HTTPS and SSL certificate on AWS
                            
                                AWS: instance metadata for iam is not found
                            
                                How to add an Internet Gateway to a VPC using AWS CDK?
                            
                                AWS Lambda using Node Js gives "connect ETIMEDOUT" on http.request()
                            
                                Getting error "Node is not supported" using aws amplify datastore on react native and expo
                            
                                boto3.exceptions.S3UploadFailedError: An error occurred (AccessDenied) when calling the PutObject operation: Access Denied
                            
                                Terraform 0.12 aws_lambda_permission resource replaced every apply
                            
                                Amazon AWS Kinesis Video Boto GetMedia/PutMedia
                            
                                boto3 find object by metadata or tag
                            
                                How to get RDS instance hostname in CDK app?
                            
                                DynamoDb: How to retrieve the first item (by sort key) for each of a given list of partition keys

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With