I have simple C code that does this (pseudo code): <pre class="prettyprint"><code>#define N 100000000 int *DataSrc = (int *) malloc(N); int *DataDest = (int *) malloc(N); memset(DataSrc, 0, N); for (int i = 0 ; i < 4 ; i++) { StartTimer(); memcpy(DataDest, DataSrc, N); StopTimer(); } printf("%d\n", DataDest[RandomInteger]); </code></pre> My PC: Intel Core i7-3930, with 4x4GB DDR3 1600 memory running RedHat 6.1 64-bit. The first <code>memcpy()</code> occurs at 1.9 GB/sec, while the next three occur at 6.2 GB/s. The buffer size (<code>N</code>) is too big for this to be caused by cache effects. So, my first Question: <ul> <li>Why is the first <code>memcpy()</code> so much slower? Maybe <code>malloc()</code> doesn't fully allocate the memory until you use it?</li> </ul> If I eliminate the <code>memset()</code>, then the first <code>memcpy()</code> runs at about 1.5 GB/sec, but the next three run at 11.8 GB/sec. Almost 2x speedup. My second question: <ul> <li>Why is <code>memcpy()</code> 2x faster if I don't call <code>memset()</code>?</li> </ul>

As others already pointed out, Linux uses an optimistic memory allocation strategy. The difference between the first and the following <code>memcpy</code>s is the initialization of <code>DataDest</code>. As you have already seen, when you eliminate <code>memset(DataSrc, 0, N)</code>, the first <code>memcpy</code> is even slower, because the pages for the source must be allocated as well. When you initialize both, <code>DataSrc</code> and <code>DataDest</code>, e.g. <pre class="prettyprint"><code>memset(DataSrc, 0, N); memset(DataDest, 0, N); </code></pre> all <code>memcpy</code>s will run with roughly the same speed. For the second question: when you initialize the allocated memory with <code>memset</code> all pages will be laid out consecutively. On the other side, when the memory is allocated as you copy, the source and destination pages will be allocated interleaved, which might make the difference.

Performance: memset

I have simple C code that does this (pseudo code):

#define N 100000000
int *DataSrc = (int *) malloc(N);
int *DataDest = (int *) malloc(N);
memset(DataSrc, 0, N);
for (int i = 0 ; i < 4 ; i++) {
    StartTimer();
    memcpy(DataDest, DataSrc, N);
    StopTimer();
}
printf("%d\n", DataDest[RandomInteger]);

My PC: Intel Core i7-3930, with 4x4GB DDR3 1600 memory running RedHat 6.1 64-bit.

The first memcpy() occurs at 1.9 GB/sec, while the next three occur at 6.2 GB/s. The buffer size (N) is too big for this to be caused by cache effects. So, my first Question:

Why is the first memcpy() so much slower? Maybe malloc() doesn't fully allocate the memory until you use it?

If I eliminate the memset(), then the first memcpy() runs at about 1.5 GB/sec, but the next three run at 11.8 GB/sec. Almost 2x speedup. My second question:

Why is memcpy() 2x faster if I don't call memset()?

Is memset optimized?

All zeroing operations that the pool allocator performs and many structure/array initializations that InitAll performs end up going through the memset function. Memset is one of the hottest functions on the operating system and is already quite optimized as a result.

Why memset is faster?

It is because memset()'s implementation is optimized for the size of the block it will operate upon and as Quora User pointed out, on the target architechture. Looking at the gcc disassembly offers some insight. int A[1]; memset(A, 0, sizeof(A));

Is memset faster than fill?

Roughly speaking, the memset function is 15 times faster than std::fill in my test.

Is memset deprecated?

While researching the upcoming — and significant — C23 version of the C programming language, I learned something surprising: The memset() function will be deprecated. It effectively does nothing when used in the C23 standard. The reason makes a lot of sense. I wrote about the memset() function in a Lesson from 2021.

As others already pointed out, Linux uses an optimistic memory allocation strategy.

The difference between the first and the following memcpys is the initialization of DataDest.

As you have already seen, when you eliminate memset(DataSrc, 0, N), the first memcpy is even slower, because the pages for the source must be allocated as well. When you initialize both, DataSrc and DataDest, e.g.

memset(DataSrc, 0, N);
memset(DataDest, 0, N);

all memcpys will run with roughly the same speed.

For the second question: when you initialize the allocated memory with memset all pages will be laid out consecutively. On the other side, when the memory is allocated as you copy, the source and destination pages will be allocated interleaved, which might make the difference.

This is most likely due to lazy allocation in your VM subsystem. Typically when you allocate a large amount of memory only the first N pages are actually allocated and wired to physical memory. When you access beyond these first N pages then page faults are generated and further pages are allocated and wired in on an "on demand" basis.

As to the second part of the question, I believe some VM implementations actually track zeroed pages and handle them specially. Try initialising DataSrc to actual (e.g. random) values and repeat the test.

Performance: memset

Tags:

performance

c

memory-management

JB_User

People also ask

2 Answers

Olaf Dietsche

Paul R

Recent Activity

Donate For Us

Performance: memset

Tags:

performance

c

memory-management

JB_User

People also ask

2 Answers

Olaf Dietsche

Paul R

Related questions

Recent Activity

Donate For Us