Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Speed variation between vCPUs on the same Amazon EC2 instance

I'm exploring the feasibility of running numerical computations on Amazon EC2. I currently have one c4.8xlarge instance running. It has 36 vCPUs, each of which is a hyperthread of a Haswell Xeon chip. The instance runs Ubuntu in HVM mode.

I have a GCC-optimised binary of a completely sequential (i.e. single-threaded) program. I launched 30 instances with CPU-pinning thus:

for i in `seq 0 29`; do
    nohup taskset -c $i $BINARY_PATH &> $i.out &
done

The 30 processes run almost identical calculations. There's very little disk activity (a few megabytes every 5 minutes), and there's no network activity or interprocess communication whatsoever. htop reports that all processes run constantly at 100%.

The whole thing has been running for about 4 hours at this point. Six processes (12-17) have already done their task, while processes 0, 20, 24 and 29 look as if they will require another 4 hours to complete. Other processes fall somewhere in between.

My questions are:

  1. Other than resource contention with other users, is there anything else that may be causing the significant variation in performance between the vCPUs within the same instance? As it stands, the instance would be rather unsuitable for any OpenMP or MPI jobs that synchronise between threads/ranks.
  2. Is there anything I can do to achieve a more uniform (hopefully higher) performance across the cores? I have basically excluded hyperthreading as a culprit here since the six "fast" processes are hyperthreads on the same physical cores. Perhaps there's some NUMA-related issue?
like image 257
Saran Tunyasuvunakool Avatar asked Feb 19 '26 03:02

Saran Tunyasuvunakool


1 Answers

My experience is on c3 instances. It's likely similar with c4.

For example, take a c3.2xlarge instance with 8 vCPUs (most of the explaination below is derived from direct discussion with AWS support).

In fact only the first 4 vCPUs are usable for heavy scientic calculations. The last 4 vCPUs are hyperthreads. For scientific applications it’s often not useful to use hyperthreading, it causes context swapping or reduces the available cache (and associated bandwidth) per thread.

To find out the exact mapping between the vCPUs and the physical cores, look into /proc/cpuinfo

  • "physical id" : shows the physical processor id (only one processor in c3.2xlarge)
  • "processor" : gives the number of vCPUs
  • "core id" : tells you which vCPUs map back to each Core ID.

If you put this in a table, you have:

 physical_id   processor    core_id
 0             0            0
 0             1            1
 0             2            2
 0             3            3
 0             4            0
 0             5            1
 0             6            2
 0             7            3

You can also get this from the "thread_siblings_list". Internal kernel map of cpuX's hardware threads within the same core as cpuX (https://www.kernel.org/doc/Documentation/cputopology.txt) :

cat /sys/devices/system/cpu/cpuX/topology/thread_siblings_list

When Hyper-threading is enabled each vCPU (Processor) is a "Logical Core". There will be 2 "Logical Cores" that are associated with a "Physical Core"

So, in your case, one solution is to disable hyperthreading with :

echo 0 > /sys/devices/system/cpu/cpuX/online

Where X for a c3.2xlarge would be 4...7

EDIT : you can observe this behaviour only in HVM instances. In PV instances, this topology is hidden by the hypervisor : all core ids & processor ids in /proc/cpuinfo are '0'.

like image 116
Olivier Delrieu Avatar answered Feb 22 '26 01:02

Olivier Delrieu



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!