Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

AWS batch limit number of container on single host

I have some containers with GPU Tensorflow jobs, and, if 2+ of them are executed simultaneously on a single host, only 1 will succeed (2018-05-11 13:02:19.147869: E tensorflow/core/common_runtime/direct_session.cc:171] Internal: failed initializing StreamExecutor for CUDA device ordinal 0: Internal: failed call to cuDevicePrimaryCtxRetain: CUDA_ERROR_ECC_UNCORRECTABLE, i.e. they cannot share GPUs properly).

Perfect scenario would be like following: I have 10 GPU jobs and max 5 containers. First 5 are executed, other 5 wait (at the moment, they don't wait but try to execute and fail), when one finished, 6th immediately starts on the same host, then 7th, 8th, 9th, 10th.

I use p2.xlarge, and set up 4 vCPU and 42000 memory for gpu job. According to ec2instances.info, this machine has 61.0 GiB memory and 4 vCPUs. But, anyway, batch seems to schedule several containers simultaneously, leading to described failure.

So far I tried to play with vCPU and memory parameters, but Batch's scheduler seems to ignore those.

Interesting that relevant ECS task definition has 1/-- as value for Hard/Soft memory limits (MiB), so looks like values from Batch 'job definition' are not propagated to ECS 'task definition'.

Another alternative is to setup a very big number of attempts, but

  • it's ugly
  • for long-running jobs even big number might get exhausted
  • I lose defense from forever-running jobs (e.g. mis-configured)
  • not sure how that kind of interruption would affect already running Tensorflow jobs
like image 320
dveim Avatar asked Dec 01 '25 06:12

dveim


1 Answers

What is the vCPU and Memory requirement of your Jobs, what are the instance types in your compute environment ?

If you update the vCpu and Memory of your jobs so that only one job can fit in an instance, Batch will schedule your jobs one after the other and not try to run two jobs at the same time.

For example if your Compute environment has p3.16xlarge (64vCpus,488Gib) instances and want to ensure that only one jobs runs in the instance at a time, make sure that the job specifies vCPU > 32 and Memory > 244GB

like image 121
Aswin Avatar answered Dec 04 '25 12:12

Aswin



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!