Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Application runs faster when not pegged to cores

I have an application with 4 threads, which I'm running on an 8-core Intel Process with CentOS. When I run two instances of it, the two instances combined run faster than 1 instance. Moreover, for two instances, they run faster when I do not bind each thread to a different core and left scheduling to the scheduler.

Going with that logic, perhaps, one instance is also running slow because scheduler had pinned all the 4 threads to 4 separate cores. When I pin each thread of 2 instances, that is, 8 threads pinned to 8 cores, the two instances combined also take approximately the same time as the one instance.

The application is lock intensive, as there are frequent shared memory acesses. Also note that the CPU overhead remains 0% otherwise, so there are almost no other processes consuming cycles.

Now my question is what is going on here? One explanation that comes to my mind is that when threads are pinned to different cores, they try to acquire the mutexes almost at the same time, and as a result, high contention occurs, but when they are scheduled by the scheduler, they might try to acquire mutexes at slightly different times, thus decreasing contention.

Any opinions! This is really weird, because left to the scheduler, one instance runs slower than two instances combined. How do I compile my benchmarking results. No on would believe me!

Processor Info as given by /proc/cpuinfo is the following.

processor   : 0
vendor_id   : GenuineIntel
cpu family  : 6
model       : 26
model name  : Intel(R) Xeon(R) CPU           X5550  @ 2.67GHz
stepping    : 5
cpu MHz     : 2660.076
cache size  : 8192 KB
physical id : 1
siblings    : 8
core id     : 0
cpu cores   : 4
apicid      : 16
fpu     : yes
fpu_exception   : yes
cpuid level : 11
wp      : yes
flags       : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx rdtscp lm constant_tsc ida nonstop_tsc pni monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr sse4_1 sse4_2 popcnt lahf_lm
bogomips    : 5320.15
clflush size    : 64
cache_alignment : 64
address sizes   : 40 bits physical, 48 bits virtual
power management: [8]

processor   : 1
vendor_id   : GenuineIntel
cpu family  : 6
model       : 26
model name  : Intel(R) Xeon(R) CPU           X5550  @ 2.67GHz
stepping    : 5
cpu MHz     : 2660.076
cache size  : 8192 KB
physical id : 0
siblings    : 8
core id     : 0
cpu cores   : 4
apicid      : 0
fpu     : yes
fpu_exception   : yes
cpuid level : 11
wp      : yes
flags       : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx rdtscp lm constant_tsc ida nonstop_tsc pni monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr sse4_1 sse4_2 popcnt lahf_lm
bogomips    : 5319.96
clflush size    : 64
cache_alignment : 64
address sizes   : 40 bits physical, 48 bits virtual
power management: [8]

processor   : 2
vendor_id   : GenuineIntel
cpu family  : 6
model       : 26
model name  : Intel(R) Xeon(R) CPU           X5550  @ 2.67GHz
stepping    : 5
cpu MHz     : 2660.076
cache size  : 8192 KB
physical id : 1
siblings    : 8
core id     : 1
cpu cores   : 4
apicid      : 18
fpu     : yes
fpu_exception   : yes
cpuid level : 11
wp      : yes
flags       : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx rdtscp lm constant_tsc ida nonstop_tsc pni monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr sse4_1 sse4_2 popcnt lahf_lm
bogomips    : 5320.04
clflush size    : 64
cache_alignment : 64
address sizes   : 40 bits physical, 48 bits virtual
power management: [8]

processor   : 3
vendor_id   : GenuineIntel
cpu family  : 6
model       : 26
model name  : Intel(R) Xeon(R) CPU           X5550  @ 2.67GHz
stepping    : 5
cpu MHz     : 2660.076
cache size  : 8192 KB
physical id : 0
siblings    : 8
core id     : 1
cpu cores   : 4
apicid      : 2
fpu     : yes
fpu_exception   : yes
cpuid level : 11
wp      : yes
flags       : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx rdtscp lm constant_tsc ida nonstop_tsc pni monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr sse4_1 sse4_2 popcnt lahf_lm
bogomips    : 5320.05
clflush size    : 64
cache_alignment : 64
address sizes   : 40 bits physical, 48 bits virtual
power management: [8]

processor   : 4
vendor_id   : GenuineIntel
cpu family  : 6
model       : 26
model name  : Intel(R) Xeon(R) CPU           X5550  @ 2.67GHz
stepping    : 5
cpu MHz     : 2660.076
cache size  : 8192 KB
physical id : 1
siblings    : 8
core id     : 2
cpu cores   : 4
apicid      : 20
fpu     : yes
fpu_exception   : yes
cpuid level : 11
wp      : yes
flags       : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx rdtscp lm constant_tsc ida nonstop_tsc pni monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr sse4_1 sse4_2 popcnt lahf_lm
bogomips    : 5320.04
clflush size    : 64
cache_alignment : 64
address sizes   : 40 bits physical, 48 bits virtual
power management: [8]

processor   : 5
vendor_id   : GenuineIntel
cpu family  : 6
model       : 26
model name  : Intel(R) Xeon(R) CPU           X5550  @ 2.67GHz
stepping    : 5
cpu MHz     : 2660.076
cache size  : 8192 KB
physical id : 0
siblings    : 8
core id     : 2
cpu cores   : 4
apicid      : 4
fpu     : yes
fpu_exception   : yes
cpuid level : 11
wp      : yes
flags       : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx rdtscp lm constant_tsc ida nonstop_tsc pni monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr sse4_1 sse4_2 popcnt lahf_lm
bogomips    : 5320.05
clflush size    : 64
cache_alignment : 64
address sizes   : 40 bits physical, 48 bits virtual
power management: [8]

processor   : 6
vendor_id   : GenuineIntel
cpu family  : 6
model       : 26
model name  : Intel(R) Xeon(R) CPU           X5550  @ 2.67GHz
stepping    : 5
cpu MHz     : 2660.076
cache size  : 8192 KB
physical id : 1
siblings    : 8
core id     : 3
cpu cores   : 4
apicid      : 22
fpu     : yes
fpu_exception   : yes
cpuid level : 11
wp      : yes
flags       : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx rdtscp lm constant_tsc ida nonstop_tsc pni monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr sse4_1 sse4_2 popcnt lahf_lm
bogomips    : 5320.03
clflush size    : 64
cache_alignment : 64
address sizes   : 40 bits physical, 48 bits virtual
power management: [8]

processor   : 7
vendor_id   : GenuineIntel
cpu family  : 6
model       : 26
model name  : Intel(R) Xeon(R) CPU           X5550  @ 2.67GHz
stepping    : 5
cpu MHz     : 2660.076
cache size  : 8192 KB
physical id : 0
siblings    : 8
core id     : 3
cpu cores   : 4
apicid      : 6
fpu     : yes
fpu_exception   : yes
cpuid level : 11
wp      : yes
flags       : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx rdtscp lm constant_tsc ida nonstop_tsc pni monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr sse4_1 sse4_2 popcnt lahf_lm
bogomips    : 5320.07
clflush size    : 64
cache_alignment : 64
address sizes   : 40 bits physical, 48 bits virtual
power management: [8]

processor   : 8
vendor_id   : GenuineIntel
cpu family  : 6
model       : 26
model name  : Intel(R) Xeon(R) CPU           X5550  @ 2.67GHz
stepping    : 5
cpu MHz     : 2660.076
cache size  : 8192 KB
physical id : 1
siblings    : 8
core id     : 0
cpu cores   : 4
apicid      : 17
fpu     : yes
fpu_exception   : yes
cpuid level : 11
wp      : yes
flags       : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx rdtscp lm constant_tsc ida nonstop_tsc pni monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr sse4_1 sse4_2 popcnt lahf_lm
bogomips    : 5320.01
clflush size    : 64
cache_alignment : 64
address sizes   : 40 bits physical, 48 bits virtual
power management: [8]

processor   : 9
vendor_id   : GenuineIntel
cpu family  : 6
model       : 26
model name  : Intel(R) Xeon(R) CPU           X5550  @ 2.67GHz
stepping    : 5
cpu MHz     : 2660.076
cache size  : 8192 KB
physical id : 0
siblings    : 8
core id     : 0
cpu cores   : 4
apicid      : 1
fpu     : yes
fpu_exception   : yes
cpuid level : 11
wp      : yes
flags       : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx rdtscp lm constant_tsc ida nonstop_tsc pni monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr sse4_1 sse4_2 popcnt lahf_lm
bogomips    : 5320.07
clflush size    : 64
cache_alignment : 64
address sizes   : 40 bits physical, 48 bits virtual
power management: [8]

processor   : 10
vendor_id   : GenuineIntel
cpu family  : 6
model       : 26
model name  : Intel(R) Xeon(R) CPU           X5550  @ 2.67GHz
stepping    : 5
cpu MHz     : 2660.076
cache size  : 8192 KB
physical id : 1
siblings    : 8
core id     : 1
cpu cores   : 4
apicid      : 19
fpu     : yes
fpu_exception   : yes
cpuid level : 11
wp      : yes
flags       : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx rdtscp lm constant_tsc ida nonstop_tsc pni monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr sse4_1 sse4_2 popcnt lahf_lm
bogomips    : 5320.04
clflush size    : 64
cache_alignment : 64
address sizes   : 40 bits physical, 48 bits virtual
power management: [8]

processor   : 11
vendor_id   : GenuineIntel
cpu family  : 6
model       : 26
model name  : Intel(R) Xeon(R) CPU           X5550  @ 2.67GHz
stepping    : 5
cpu MHz     : 2660.076
cache size  : 8192 KB
physical id : 0
siblings    : 8
core id     : 1
cpu cores   : 4
apicid      : 3
fpu     : yes
fpu_exception   : yes
cpuid level : 11
wp      : yes
flags       : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx rdtscp lm constant_tsc ida nonstop_tsc pni monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr sse4_1 sse4_2 popcnt lahf_lm
bogomips    : 5320.00
clflush size    : 64
cache_alignment : 64
address sizes   : 40 bits physical, 48 bits virtual
power management: [8]

processor   : 12
vendor_id   : GenuineIntel
cpu family  : 6
model       : 26
model name  : Intel(R) Xeon(R) CPU           X5550  @ 2.67GHz
stepping    : 5
cpu MHz     : 2660.076
cache size  : 8192 KB
physical id : 1
siblings    : 8
core id     : 2
cpu cores   : 4
apicid      : 21
fpu     : yes
fpu_exception   : yes
cpuid level : 11
wp      : yes
flags       : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx rdtscp lm constant_tsc ida nonstop_tsc pni monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr sse4_1 sse4_2 popcnt lahf_lm
bogomips    : 5320.03
clflush size    : 64
cache_alignment : 64
address sizes   : 40 bits physical, 48 bits virtual
power management: [8]

processor   : 13
vendor_id   : GenuineIntel
cpu family  : 6
model       : 26
model name  : Intel(R) Xeon(R) CPU           X5550  @ 2.67GHz
stepping    : 5
cpu MHz     : 2660.076
cache size  : 8192 KB
physical id : 0
siblings    : 8
core id     : 2
cpu cores   : 4
apicid      : 5
fpu     : yes
fpu_exception   : yes
cpuid level : 11
wp      : yes
flags       : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx rdtscp lm constant_tsc ida nonstop_tsc pni monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr sse4_1 sse4_2 popcnt lahf_lm
bogomips    : 5320.05
clflush size    : 64
cache_alignment : 64
address sizes   : 40 bits physical, 48 bits virtual
power management: [8]

processor   : 14
vendor_id   : GenuineIntel
cpu family  : 6
model       : 26
model name  : Intel(R) Xeon(R) CPU           X5550  @ 2.67GHz
stepping    : 5
cpu MHz     : 2660.076
cache size  : 8192 KB
physical id : 1
siblings    : 8
core id     : 3
cpu cores   : 4
apicid      : 23
fpu     : yes
fpu_exception   : yes
cpuid level : 11
wp      : yes
flags       : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx rdtscp lm constant_tsc ida nonstop_tsc pni monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr sse4_1 sse4_2 popcnt lahf_lm
bogomips    : 5320.02
clflush size    : 64
cache_alignment : 64
address sizes   : 40 bits physical, 48 bits virtual
power management: [8]

processor   : 15
vendor_id   : GenuineIntel
cpu family  : 6
model       : 26
model name  : Intel(R) Xeon(R) CPU           X5550  @ 2.67GHz
stepping    : 5
cpu MHz     : 2660.076
cache size  : 8192 KB
physical id : 0
siblings    : 8
core id     : 3
cpu cores   : 4
apicid      : 7
fpu     : yes
fpu_exception   : yes
cpuid level : 11
wp      : yes
flags       : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx rdtscp lm constant_tsc ida nonstop_tsc pni monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr sse4_1 sse4_2 popcnt lahf_lm
bogomips    : 5320.04
clflush size    : 64
cache_alignment : 64
address sizes   : 40 bits physical, 48 bits virtual
power management: [8]
like image 489
MetallicPriest Avatar asked Sep 01 '25 10:09

MetallicPriest


2 Answers

One explanation is cache and memory architecture; and data locality.

According to Intel's spec finder, for your X5550 CPU there's 4 cores (and 8 logical CPUs) per physical chip. If you've got 16 logical CPUs then you've got 2 physical chips (which is entirely possible for this Xeon). More research reveals that each core has it's own L1 and L2 caches that are shared by both logical CPUs in that core; and each physical chip has 8 MiB of L3 cache that is shared by all 4 cores (8 logical CPUs) on that chip.

If logical CPUs 0 and 1 share an L2 cache and logical CPUs 2 and 3 share a separate L2 cache, then 2 threads modifying the same memory may run better on logical CPUs 0 and 1 (rather than logical CPUs 0 and 2) because the data remains in the "exclusive" state on one cache (and isn't bouncing between different L2 caches).

In the same way, if the first eight logical CPUs share an L3 cache and the other group of eight logical CPUs share an L3 cache, then 8 threads modifying the same memory may run better when restricted to one group of logical CPUs (and not spread across separate physical chips with separate L3 caches).

The other thing is that with a pair of "Xeon X550" CPUs it's extremely likely that your computer is ccNUMA. Two separate memory banks, where one memory bank is directly connected to the first physical CPU and the other memory bank directly connected to the second physical CPU. In this case, when a CPU tries to access RAM that isn't directly connected to that CPU, it needs to ask the other CPU (where the RAM is connected) to fetch it (rather than being able to fetch it itself), and this has a performance penalty (roughly "15% slower").

Linux does have special support ccNUMA systems. It will try to limit each process (and all its threads) to a specific group of CPUs and allocate memory that is directly connected to those CPUs; in an attempt to avoid that "roughly 15% slower" performance penalty.

If you combine all of this, a process with 8 threads that is CPU bound and is continually fighting for the same locks (modifying cache lines where those locks are) will probably run better on the first 8 logical CPUs or the last 8 logical CPUs; and there will be a relatively high performance problem (maybe as bad as "30% slower" once you combined the NUMA penalty and the cache bounce problem) if you tried to spread the process' threads evenly across logical CPUs (e.g. CPUs 0, 2, 4, 6, 8, 10, etc).

like image 198
Brendan Avatar answered Sep 03 '25 04:09

Brendan


Is your CPU Hyper-Threading enabled?

Try to bind processes to even-numbered cores: CPU0, CPU2, CPU6, CPU8 as it was suggested by Agner fox (there is a citation in Linux find out Hyper-threaded core id )

If you binded 4 threads to CPU0- CPU3, you will bind them to 2 physical cores and there will be contention for resources (HT-thread is not as fast physical thread IF there is another HT-thread on the same CPU).

In linux, HT cores are numbered in pairs: so CPU0-CPU1 are two HT-halves of first physical core, CPU2-CPU3 - of second and so on.

If you get a slowdown from using only even-numbered CPUs, I conclude you have a lot of syncronization between threads. HT-threads can be synchronized fasted than threads of different physical cores.

like image 22
osgx Avatar answered Sep 03 '25 04:09

osgx