Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

64-bit Linux performance issue with memset

I'm debugging an application that is running quite a bit slower when built as a 64-bit Linux ELF executable than as a 32-bit Linux ELF executable. Using Rational (IBM) Quantify, I tracked much of the performance difference down to (drum roll...) memset. Oddly, memset is taking a lot longer in the 64-bit executable.

I am even able to see this with a small, simple application:

#include <stdlib.h>
#include <string.h>

#define BUFFER_LENGTH 8000000

int main()
{
  unsigned char* buffer = malloc(BUFFER_LENGTH * sizeof(unsigned char));
  for(int i = 0; i < 10000; i++)
    memset(buffer, 0, BUFFER_LENGTH * sizeof(unsigned char));
}

I build like this:
$ gcc -m32 -std=gnu99 -g -O3 ms.c
and
$ gcc -m64 -std=gnu99 -g -O3 ms.c

The wall-clock time as reported by time is longer for the -m64 build and Quantify confirms that the extra time is being spent in memset.

So far I've tested in VirtualBox and VMWare (but not bare-metal Linux; I realize I need to do that next). The amount of extra time spent seems to vary a bit from one system to the next.

What's going on here? Is there a well-known issue that my Google-foo is not able to uncover?

EDIT: The disassembly (gcc ... -S) on my system shows that memset is being invoked as an external function:

32-bit:

.LBB2:
    .loc 1 14 0
    movl    $8000000, 8(%esp)
    .loc 1 12 0
    addl    $1, %ebx
    .loc 1 14 0
    movl    $0, 4(%esp)
    movl    %esi, (%esp)
    call    memset

64-bit:

.LBB2:
    .loc 1 14 0
    xorl    %esi, %esi
    movl    $8000000, %edx
    movq    %rbp, %rdi
.LVL1:
    .loc 1 12 0
    addl    $1, %ebx
    .loc 1 14 0
    call    memset

System:

  • CentOS 5.7 2.6.18-274.17.1.el5 x86_64
  • GCC 4.1.2
  • Intel(R) Core(TM) i7-2600K CPU @ 3.40GHz / VirtualBox
    (discrepancy is worse on a Xeon E5620 @ 2.40GHz / VMWare)
like image 919
David Citron Avatar asked Jan 23 '12 07:01

David Citron


2 Answers

I believe that virtualization is the culprit: I have been running some benchmarks on my own (random number generation in bulk, sequential searches; also 64-bit) and found out that the code runs ~2x slower within Linux in VirtualBox than natively under windows. The funny thing is, the code does no I/O (except simple printf now and then, in between timings) and uses little memory (all data fits into L1 cache), so one could think that you could exclude page table management and TLB overheads.

This is mysterious indeed. I have noticed that VirtualBox reports to the VM that SSE 4.1 and SSE 4.2 instructions are not supported, even though the CPU supports them, and the program using them runs fine(!) in a VM. I have no time to investigate the issue further, but you REALLY should time it on a real machine. Unfortunately, my program won't run on 32 bits, so I couldn't test for slowdown in 32-bit mode.

like image 183
zvrba Avatar answered Nov 10 '22 09:11

zvrba


I can confirm that on my non-virtualized Mandriva Linux system the x86_64 version is slightly (about 7%) slower. In both cases the memset() library function is called, regardless of the instruction set word size.

A casual look at the assembly code of both library implementations reveals that the x86_64 version is significantly more complex. I assume that this has mostly to do with the fact that the 32-bit version has to deal with only 4 possible alignment cases, versus the 8 possible alignment cases of the 64-bit version. It also seems that the x86_64 memset() loop has been more extensively unrolled, perhaps due to different compiler optimizations.

Another factor that could account for the slower operations is the increased I/O load associated with the use of a word size of 64 bits. Both code and metadata (pointers e.t.c.) generally get larger in 64-bit applications.

Also, keep in mind that the library implementations included in most distributions are targeted to whatever CPU the maintainers consider to be the current lowest common denominator for each processor family. This may leave the 64-bit processors at a disadvantage, since the 32-bit instruction set has been stable for some time now.

like image 38
thkala Avatar answered Nov 10 '22 09:11

thkala