I want to measure memory bandwidth using memcpy
. I modified the code from this answer:why vectorizing the loop does not have performance improvement which used memset
to measure the bandwidth. The problem is that memcpy
is only slighly slower than memset
when I expect it to be about two times slower since it operations on twice the memory.
More specifically, I run over 1 GB arrays a
and b
(allocated will calloc
) 100 times with the following operations.
operation time(s)
-----------------------------
memset(a,0xff,LEN) 3.7
memcpy(a,b,LEN) 3.9
a[j] += b[j] 9.4
memcpy(a,b,LEN) 3.8
Notice that memcpy
is only slightly slower then memset
. The operations a[j] += b[j]
(where j
goes over [0,LEN)
) should take three times longer than memcpy
because it operates on three times as much data. However it's only about 2.5 as slow as memset
.
Then I initialized b
to zero with memset(b,0,LEN)
and test again:
operation time(s)
-----------------------------
memcpy(a,b,LEN) 8.2
a[j] += b[j] 11.5
Now we see that memcpy
is about twice as slow as memset
and a[j] += b[j]
is about thrice as slow as memset
like I expect.
At the very least I would have expected that before memset(b,0,LEN)
that memcpy
would be slower because the of lazy allocation (first touch) on the first of the 100 iterations.
Why do I only get the time I expect after memset(b,0,LEN)
?
test.c
#include <time.h>
#include <string.h>
#include <stdio.h>
void tests(char *a, char *b, const int LEN){
clock_t time0, time1;
time0 = clock();
for (int i = 0; i < 100; i++) memset(a,0xff,LEN);
time1 = clock();
printf("%f\n", (double)(time1 - time0) / CLOCKS_PER_SEC);
time0 = clock();
for (int i = 0; i < 100; i++) memcpy(a,b,LEN);
time1 = clock();
printf("%f\n", (double)(time1 - time0) / CLOCKS_PER_SEC);
time0 = clock();
for (int i = 0; i < 100; i++) for(int j=0; j<LEN; j++) a[j] += b[j];
time1 = clock();
printf("%f\n", (double)(time1 - time0) / CLOCKS_PER_SEC);
time0 = clock();
for (int i = 0; i < 100; i++) memcpy(a,b,LEN);
time1 = clock();
printf("%f\n", (double)(time1 - time0) / CLOCKS_PER_SEC);
memset(b,0,LEN);
time0 = clock();
for (int i = 0; i < 100; i++) memcpy(a,b,LEN);
time1 = clock();
printf("%f\n", (double)(time1 - time0) / CLOCKS_PER_SEC);
time0 = clock();
for (int i = 0; i < 100; i++) for(int j=0; j<LEN; j++) a[j] += b[j];
time1 = clock();
printf("%f\n", (double)(time1 - time0) / CLOCKS_PER_SEC);
}
main.c
#include <stdlib.h>
int tests(char *a, char *b, const int LEN);
int main(void) {
const int LEN = 1 << 30; // 1GB
char *a = (char*)calloc(LEN,1);
char *b = (char*)calloc(LEN,1);
tests(a, b, LEN);
}
Compile with (gcc 6.2) gcc -O3 test.c main.c
. Clang 3.8 gives essentially the same result.
Test system: [email protected] (Skylake), 32 GB DDR4, Ubuntu 16.10. On my Haswell system the bandwidths make sense before memset(b,0,LEN)
i.e. I only see a problem on my Skylake system.
I first discovered this issue from the a[j] += b[k]
operations in this answer which was overestimating the bandwidth.
I came up with a simpler test
#include <time.h>
#include <string.h>
#include <stdio.h>
void __attribute__ ((noinline)) foo(char *a, char *b, const int LEN) {
for (int i = 0; i < 100; i++) for(int j=0; j<LEN; j++) a[j] += b[j];
}
void tests(char *a, char *b, const int LEN) {
foo(a, b, LEN);
memset(b,0,LEN);
foo(a, b, LEN);
}
This outputs.
9.472976
12.728426
However, if I do memset(b,1,LEN)
in main after calloc
(see below) then it outputs
12.5
12.5
This leads me to to think this is a OS allocation issue and not a compiler issue.
#include <stdlib.h>
int tests(char *a, char *b, const int LEN);
int main(void) {
const int LEN = 1 << 30; // 1GB
char *a = (char*)calloc(LEN,1);
char *b = (char*)calloc(LEN,1);
//GCC optimizes memset(b,0,LEN) away after calloc but Clang does not.
memset(b,1,LEN);
tests(a, b, LEN);
}
The point is that malloc
and calloc
on most platforms don't allocate memory; they allocate address space.
malloc
etc work by:
calloc
: the equivalent ofmemset(ptr, 0, size)
is issuedFor systems with demand paging (COW) (an MMU could help here), the second options winds downto:
/dev/zero
This will consume no physical memory, except only for the Page Tables.
/dev/zero
. The /dev/zero
device is a very special device, in this case mapped to every page of the new memory.Your b
array probably was not written after mmap
-ing (huge allocation requests with malloc/calloc are usually converted into mmap
). And whole array was mmaped to single read-only "zero page" (part of COW mechanism). Reading zeroes from single page is faster than reading from many pages, as single page will be kept in the cache and in TLB. This explains why test before memset(0) was faster:
This outputs. 9.472976 12.728426
However, if I do
memset(b,1,LEN)
in main aftercalloc
(see below) then it outputs: 12.5 12.5
And more about gcc's malloc+memset / calloc+memset optimization into calloc (expanded from my comment)
//GCC optimizes memset(b,0,LEN) away after calloc but Clang does not.
This optimization was proposed in https://gcc.gnu.org/bugzilla/show_bug.cgi?id=57742 (tree-optimization PR57742) at 2013-06-27 by Marc Glisse (https://stackoverflow.com/users/1918193?) as planned for 4.9/5.0 version of GCC:
memset(malloc(n),0,n) -> calloc(n,1)
calloc can sometimes be significantly faster than malloc+bzero because it has special knowledge that some memory is already zero. When other optimizations simplify some code to malloc+memset(0), it would thus be nice to replace it with calloc. Sadly, I don't think there is a way to do a similar optimization in C++ with new, which is where such code most easily appears (creating std::vector(10000) for instance). And there would also be the complication there that the size of the memset would be a bit smaller than that of the malloc (using calloc would still be fine, but it gets harder to know if it is an improvement).
Implemented at 2014-06-24 (https://gcc.gnu.org/bugzilla/show_bug.cgi?id=57742#c15) - https://gcc.gnu.org/viewcvs/gcc?view=revision&revision=211956 (also https://patchwork.ozlabs.org/patch/325357/)
- tree-ssa-strlen.c ... (handle_builtin_malloc, handle_builtin_memset): New functions.
The current code in gcc/tree-ssa-strlen.c
https://github.com/gcc-mirror/gcc/blob/7a31ada4c400351a35ab65f8dc0357e7c88805d5/gcc/tree-ssa-strlen.c#L1889 - if memset(0)
get pointer from malloc
or calloc
, it will convert malloc
into calloc
and then memset(0)
will be removed:
/* Handle a call to memset.
After a call to calloc, memset(,0,) is unnecessary.
memset(malloc(n),0,n) is calloc(n,1). */
static bool
handle_builtin_memset (gimple_stmt_iterator *gsi)
...
if (code1 == BUILT_IN_CALLOC)
/* Not touching stmt1 */ ;
else if (code1 == BUILT_IN_MALLOC
&& operand_equal_p (gimple_call_arg (stmt1, 0), size, 0))
{
gimple_stmt_iterator gsi1 = gsi_for_stmt (stmt1);
update_gimple_call (&gsi1, builtin_decl_implicit (BUILT_IN_CALLOC), 2,
size, build_one_cst (size_type_node));
si1->length = build_int_cst (size_type_node, 0);
si1->stmt = gsi_stmt (gsi1);
}
This was discussed in gcc-patches mailing list in Mar 1, 2014 - Jul 15, 2014 with subject "calloc = malloc + memset"
with notable comment from Andi Kleen (http://halobates.de/blog/, https://github.com/andikleen): https://gcc.gnu.org/ml/gcc-patches/2014-06/msg01818.html
FWIW i believe the transformation will break a large variety of micro benchmarks.
calloc
internally knows that memory fresh from the OS is zeroed. But the memory may not be faulted in yet.
memset
always faults in the memory.So if you have some test like
buf = malloc(...) memset(buf, ...) start = get_time(); ... do something with buf end = get_time()
Now the times will be completely off because the measured times includes the page faults.
Marc replied "Good point. I guess working around compiler optimizations is part of the game for micro benchmarks, and their authors would be disappointed if the compiler didn't mess it up regularly in new and entertaining ways ;-)" and Andi asked: "I would prefer to not do it. I'm not sure it has a lot of benefit. If you want to keep it please make sure there is an easy way to turn it off."
Marc shows how to turn this optimization off: https://gcc.gnu.org/ml/gcc-patches/2014-06/msg01834.html
Any of these flags works:
-fdisable-tree-strlen
-fno-builtin-malloc
-fno-builtin-memset
(assuming you wrote 'memset' explicitly in your code)-fno-builtin
-ffreestanding
-O1
-Os
In the code, you can hide that the pointer passed to
memset
is the one returned bymalloc
by storing it in avolatile
variable, or any other trick to hide from the compiler that we are doingmemset(malloc(n),0,n)
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With