Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Memory-intense jobs scaling poorly on multi-core cloud instances (ec2, gce, rackspace)?

Has anyone else noticed terrible performance when scaling up to use all the cores on a cloud instance with somewhat memory intense jobs (2.5GB in my case)?

When I run jobs locally on my quad xeon chip, the difference between using 1 core and all 4 cores is about a 25% slowdown with all cores. This is to be expected from what I understand; a drop in clock rate as the cores get used up is part of the multi-core chip design.

But when I run the jobs on a multicore virtual instance, I am seeing a slowdown of like 2x - 4x in processing time between using 1 core and all cores. I've seen this on GCE, EC2, and Rackspace instances. And I have tested many difference instance types, mostly the fastest offered.

So has this behavior been seen by others with jobs about the same size in memory usage?

The jobs I am running are written in fortran. I did not write them, and I'm not really a fortran guy so my knowledge of them is limited. I know they have low I/O needs. They appear to be CPU-bound when I watch top as they run. They run without the need to communicate with each other, ie., embarrasingly parallel. They each take about 2.5GB in memory.

So my best guess so far is that jobs that use up this much memory take a big hit by the virtualization layer's memory management. It could also be that my jobs are competing for an I/O resource, but this seems highly unlikely according to an expert.

My workaround for now is to use GCE because they have single-core instance that actually runs the jobs as fast as my laptop's chip, and are priced almost proportionally by core.

like image 257
Jose Cortez Avatar asked Nov 13 '22 03:11

Jose Cortez


1 Answers

You might be running into memory bandwidth constraints, depending on your data access pattern.

The linux perf tool might give some insight into this, though I'll admit that I don't entirely understand your description of the problem. If I understand correctly:

  1. Running one copy of the single-threaded program on your laptop takes X minutes to complete.

  2. Running 4 copies of the single-threaded program on your laptop, each copy takes X * 1.25 minutes to complete.

  3. Running one copy of the single-threaded program on various cloud instances takes X minutes to complete.

  4. Running N copies of the single-threaded program on an N-core virtual cloud instances, each copy takes X * 2-4 minutes to complete.

If so, it sounds like you're either running into a kernel contention or contention for e.g. memory I/O. It would be interesting to see whether various fortran compiler options might help optimize memory access patterns; for example, enabling SSE2 load/store intrinsics or other optimizations. You might also compare results with gcc and intel's fortran compilers.

like image 187
E. Anderson Avatar answered Nov 15 '22 09:11

E. Anderson