AWS batch limit number of container on single host

Question

I have some containers with GPU Tensorflow jobs, and, if 2+ of them are executed simultaneously on a single host, only 1 will succeed (2018-05-11 13:02:19.147869: E tensorflow/core/common_runtime/direct_session.cc:171] Internal: failed initializing StreamExecutor for CUDA device ordinal 0: Internal: failed call to cuDevicePrimaryCtxRetain: CUDA_ERROR_ECC_UNCORRECTABLE, i.e. they cannot share GPUs properly).

Perfect scenario would be like following: I have 10 GPU jobs and max 5 containers. First 5 are executed, other 5 wait (at the moment, they don't wait but try to execute and fail), when one finished, 6th immediately starts on the same host, then 7th, 8th, 9th, 10th.

I use p2.xlarge, and set up 4 vCPU and 42000 memory for gpu job. According to ec2instances.info, this machine has 61.0 GiB memory and 4 vCPUs. But, anyway, batch seems to schedule several containers simultaneously, leading to described failure.

So far I tried to play with vCPU and memory parameters, but Batch's scheduler seems to ignore those.

Interesting that relevant ECS task definition has 1/-- as value for Hard/Soft memory limits (MiB), so looks like values from Batch 'job definition' are not propagated to ECS 'task definition'.

Another alternative is to setup a very big number of attempts, but

it's ugly
for long-running jobs even big number might get exhausted
I lose defense from forever-running jobs (e.g. mis-configured)
not sure how that kind of interruption would affect already running Tensorflow jobs

Aswin · Accepted Answer

What is the vCPU and Memory requirement of your Jobs, what are the instance types in your compute environment ?

If you update the vCpu and Memory of your jobs so that only one job can fit in an instance, Batch will schedule your jobs one after the other and not try to run two jobs at the same time.

For example if your Compute environment has p3.16xlarge (64vCpus,488Gib) instances and want to ensure that only one jobs runs in the instance at a time, make sure that the job specifies vCPU > 32 and Memory > 244GB

AWS batch limit number of container on single host

Tags:

amazon-web-services

tensorflow

aws-batch

dveim

1 Answers

Aswin

Recent Activity

Donate For Us

AWS batch limit number of container on single host

Tags:

amazon-web-services

tensorflow

aws-batch

dveim

1 Answers

Aswin

Related questions

Recent Activity

Donate For Us