C: memcpy speed on dynamically allocated arrays

Question

I need help with the performance of the following code. It does a memcpy on two dynamically allocated arrays of arbitrary size:

int main()
{
  double *a, *b;
  unsigned n = 10000000, i;
  a = malloc(n*sizeof(double));
  b = malloc(n*sizeof(double));
  for(i=0; i<n; i++) {
    a[i] = 1.0;
    /* b[i] = 0.0; */
  }

  tic();
  bzero(b, n*sizeof(double));
  toc("bzero1");

  tic();
  bzero(b, n*sizeof(double));
  toc("bzero2");

  tic();
  memcpy(b, a, n*sizeof(double));
  toc("memcpy");
}

tic/toc measure the execution time.

On my computer it takes 0.035s to memcpy (Linux, gcc version 4.4.6). If I now uncomment the line which initializes the destination array b, the code is three times faster (!) - 0.011s.

I have observed similar behavior when using a loop instead of memcpy. Usually I do not care about this since it is enough to 'initialize' the memory before using it. However, I now need to perform a simple memory copy, and do it as fast as possible. Initializing the data requires writing e.g. 0 to the memory, which is not necessary and takes time. And I would like to perform a memory copy with all available memory bandwidth.

Is there a solution to this problem? Or is it connected to the way Linux handles dynamic memory (some sort of lazy page allocation?) and can not be worked around? How is it on other systems?

Edit: The same results are obtained with gcc 4.6. I used -O3 to compile.

Edit: Thank you all for your comments. I do understand that memory mapping takes time. I guess I just have a hard time accepting that it takes so long, much longer than the actual memory access. The code has been modified to include a benchmark of the initialization of array b using two subsequent bzero calls. The timings now show

bzero1 0.273981
bzero2 0.056803
memcpy 0.117934

Clearly, the first bzero call does much more than just stream zeros to memory - that is memory mapping and memory zeroing. The second bzero call on the other hand takes half of the time required to do a memcpy, which is exactly as expected - write only time vs. read and write time. I understand that the overhead of the second bzero call must be there due to OS security reasons. What about the rest? Can I not decrease it somehow, e.g. use larger memory pages? Different kernel settings?

I should mention that I run this on Ubuntu wheeze.

Steve Jessop · Accepted Answer

It's probably lazy page allocation, Linux only mapping the pages on first access. IIRC each page in a new block in Linux is a copy-on-write of a blank page, and your allocations are big enough to demand new blocks.

If you want to work around it, you could write one byte or word, at 4k intervals. That might get the virtual addresses mapped to RAM slightly faster than writing the whole of each page.

I wouldn't expect (most efficient workaround to force the lazy memory mapping to happen) plus (copy) to be significantly faster than just (copy) without the initialization of b, though. So unless there's a specific reason why you care about the performance just of the copy, not of the whole operation, then it's fairly futile. It's "pay now or pay later", Linux pays later, and you're only measuring the time for later.

verdesmarald · Answer

Surely if you are comparing the speed of initialise and copy to the speed of just copy, then the initialisation should be included in timed section. It appears to me you should actually be comparing this:

// Version 1
for(i=0; i<n; i++)
    a[i] = 1.0;

tic();

memcpy(b, a, n*sizeof(double));

toc();

To this:

// Version 2
for(i=0; i<n; i++)
    a[i] = 1.0;

tic();

for(i=0; i<n; i++)
    b[i] = 0.0;
memcpy(b, a, n*sizeof(double));

toc();

I expect this will see your 3x speed improvement drop sharply.

EDIT: As suggested by Steve Jessop, you may also want to test a third strategy of only touching one entry per page:

// Version 3
for(i=0; i<n; i++)
    a[i] = 1.0;

tic();

for(i=0; i<n; i+=DOUBLES_PER_PAGE)
    b[i] = 0.0;
memcpy(b, a, n*sizeof(double));

toc();

Evgeny Kluev · Answer

The first bzero runs longer because of (1) lazy page allocation and (2) lazy page zero-initialization by kernel. While second reason is unavoidable because of security reasons, lazy page allocation may be optimized by using larger ("huge") pages.

There are at least two ways to use huge pages on Linux. Hard way is hugetlbfs. Easy way is Transparent huge pages.

Search khugepaged in the list of processes on your system. If such process exists, transparent huge pages are supported, you can use them in your application if you change malloc to this:

posix_memalign((void **)&b, 2*1024*1024, n*sizeof(double));
madvise((void *)b, n*sizeof(double), MADV_HUGEPAGE);

C: memcpy speed on dynamically allocated arrays

Tags:

performance

c

malloc

memcpy

angainor

3 Answers

Steve Jessop

verdesmarald

Evgeny Kluev

Recent Activity

Donate For Us

C: memcpy speed on dynamically allocated arrays

Tags:

performance

c

malloc

memcpy

angainor

3 Answers

Steve Jessop

verdesmarald

Evgeny Kluev

Related questions

Recent Activity

Donate For Us