Why is memset slow?

Question

The spec for my CPU says it should get 5.336GB/s bandwidth to memory. To test this, I wrote a simple program that runs memset (or memcpy) on a big array and reports the timing. I'm showing 3.8GB/s on memset and 1.9GB/s on memcpy. http://en.wikipedia.org/wiki/Intel_Core_(microarchitecture) says my Q9400 should be getting 5.336MB/s. What's wrong?

I've tried replacing memset or memcpy with assignment loops. I've googled around to try to learn about memory alignment. I've tried different compiler flags. I've spent an embarrassing number of hours on this. Thanks for any help you can provide!

I'm using Ubuntu 12.04 with libc-dev version 2.15-0ubuntu10.5 and kernel 3.8.0-37-generic

The code:

#include <stdio.h>
#include <time.h>
#include <string.h>
#include <stdlib.h>

#define numBytes ((long)(1024*1024*1024))
#define numTransfers ((long)(8))

int main(int argc,char**argv){
    if(argc!=3){
        printf("Usage: %s BLOCK_SIZE_IN_BYTES NUMBER_OF_BLOCKS_TO_TRANSFER
",argv[0]);
        return -1;
    }
    char*__restrict__ source=(char*)malloc(numBytes);
    char*__restrict__ dest=(char*)malloc(numBytes);
    struct timespec start,end;
    long totalTimeMs;
    int i;

    clock_gettime(CLOCK_MONOTONIC_RAW,&start);
    for(i=0;i<numTransfers;++i)
        memset(source,0,numBytes);
    clock_gettime(CLOCK_MONOTONIC_RAW,&end);
    totalTimeMs=(end.tv_nsec-start.tv_nsec)*.000001+1000*(end.tv_sec-start.tv_sec);
    printf("memset %ld bytes %ld times (%.2fGB total) in %ldms (%.3fGB/s). ",numBytes,numTransfers,numBytes/1024.0/1024/1024*numTransfers,totalTimeMs,numBytes/1024.0/1024/1024*1000*numTransfers/totalTimeMs);

    clock_gettime(CLOCK_MONOTONIC_RAW,&start);
    for(i=0;i<numTransfers;++i)
        memcpy( dest, source, numBytes);
    clock_gettime(CLOCK_MONOTONIC_RAW,&end);
    totalTimeMs=(end.tv_nsec-start.tv_nsec)*.000001+1000*(end.tv_sec-start.tv_sec);
    printf("memcpy %ld bytes %ld times (%.2fGB total) in %ldms (%.3fGB/s).
",numBytes,numTransfers,numBytes/1024.0/1024/1024*numTransfers,totalTimeMs,numBytes/1024.0/1024/1024*1000*numTransfers/totalTimeMs);

    free(source);
    free(dest);

    return EXIT_SUCCESS;
}

Compile commands:

gcc -O3 -DNDEBUG -o memcpyStackOverflowNoParameters.c.o -c memcpyStackOverflowNoParameters.c
gcc -O3 -DNDEBUG memcpyStackOverflowNoParameters.c.o -o memcpy -rdynamic -lrt

Sample outputs:

memset 1073741824 bytes 8 times (8.00GB total) in 2214ms (3.880GB/s). memcpy 1073741824 bytes 8 times (8.00GB total) in 4466ms (1.923GB/s).
memset 1073741824 bytes 8 times (8.00GB total) in 2218ms (3.873GB/s). memcpy 1073741824 bytes 8 times (8.00GB total) in 4557ms (1.885GB/s).
memset 1073741824 bytes 8 times (8.00GB total) in 2222ms (3.866GB/s). memcpy 1073741824 bytes 8 times (8.00GB total) in 4433ms (1.938GB/s).
memset 1073741824 bytes 8 times (8.00GB total) in 2216ms (3.876GB/s). memcpy 1073741824 bytes 8 times (8.00GB total) in 4521ms (1.900GB/s).
memset 1073741824 bytes 8 times (8.00GB total) in 2217ms (3.875GB/s). memcpy 1073741824 bytes 8 times (8.00GB total) in 4520ms (1.900GB/s).
memset 1073741824 bytes 8 times (8.00GB total) in 2218ms (3.873GB/s). memcpy 1073741824 bytes 8 times (8.00GB total) in 4430ms (1.939GB/s).
memset 1073741824 bytes 8 times (8.00GB total) in 2226ms (3.859GB/s). memcpy 1073741824 bytes 8 times (8.00GB total) in 4444ms (1.933GB/s).
memset 1073741824 bytes 8 times (8.00GB total) in 2225ms (3.861GB/s). memcpy 1073741824 bytes 8 times (8.00GB total) in 4485ms (1.915GB/s).
memset 1073741824 bytes 8 times (8.00GB total) in 2620ms (3.279GB/s). memcpy 1073741824 bytes 8 times (8.00GB total) in 4855ms (1.769GB/s).
memset 1073741824 bytes 8 times (8.00GB total) in 2535ms (3.389GB/s). memcpy 1073741824 bytes 8 times (8.00GB total) in 4870ms (1.764GB/s).
memset 1073741824 bytes 8 times (8.00GB total) in 2423ms (3.545GB/s). memcpy 1073741824 bytes 8 times (8.00GB total) in 4905ms (1.751GB/s).

My hardware according to lshw:

  product: OptiPlex 960 ()
  vendor: Winbond Electronics
  width: 64 bits
*-core
     description: Motherboard
     product: 0Y958C
     vendor: Winbond Electronics
   *-firmware
        description: BIOS
        capabilities: pci pnp apm upgrade shadowing escd cdboot bootselect edd int13floppytoshiba int13floppy720 int5printscreen int9keyboard int14serial int17printer acpi usb biosbootspecification netboot
   *-cpu
        product: Intel(R) Core(TM)2 Quad CPU    Q9400  @ 2.66GHz
        physical id: 400
        size: 2666MHz
        width: 64 bits
        clock: 1333MHz
        capabilities: x86-64 fpu fpu_exception wp vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx constant_tsc arch_perfmon pebs bts rep_good nopl aperfmperf pni dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm sse4_1 xsave lahf_lm dtherm tpr_shadow vnmi flexpriority
        configuration: cores=4 enabledcores=4 threads=4
      *-cache:0
           description: L1 cache
           physical id: 700
           size: 256KiB
           capacity: 256KiB
           capabilities: internal write-back unified
      *-cache:1
           description: L2 cache
           physical id: 701
           size: 6MiB
           capacity: 6MiB
           capabilities: internal varies unified
   *-memory
        description: System Memory
        physical id: 1000
        slot: System board or motherboard
        size: 4GiB
      *-bank:0
           description: DIMM DDR2 Synchronous 667 MHz (1.5 ns)
           product: CT51264AA667.M16FC
           vendor: 7F7F7F7F7F9B0000
           slot: DIMM_1
           size: 4GiB
           width: 64 bits
           clock: 667MHz (1.5ns)
      *-bank:1
           description: DIMM DDR2 Synchronous 667 MHz (1.5 ns) [empty]
      *-bank:2
           description: DIMM DDR2 Synchronous 667 MHz (1.5 ns) [empty]
      *-bank:3
           description: DIMM DDR2 Synchronous 667 MHz (1.5 ns) [empty]

jthill · Accepted Answer

Memory addresses are "virtualized", the addresses your program uses are translated to real addresses. This translation makes it possible to allocate what your program sees as contiguous memory from whatever pieces are handy at the time. Every general-purpose CPU does this. The translation requires a table lookup, which requires memory access. The CPU's got caches for it, but long stretches of virtual addresses can easily blow its cache, the "TLB" ("translation lookaside buffer"). So every 4KB (2MB on a linux system that's figured out what you're doing) the CPU stalls hunting up where to really send your memory traffic. Those stalls can take quite a bit of time. You might try running two copies of your benchmark, it seems reasonable that the TLB misses won't coincide and you'll get aggregate bandwidth much closer to your rated capacity.

(edit: um, you might want to replace your #defines with

size_t numBytes=atoi(argv[1]);
size_t numTransfers=atoi(argv[2]);

in the main body ...)

Edit: by the way: the bandwidth I saw (and reported in comments) from this test on my box was so far below rated capacity for my cpu that it got me investigating my own system. My box builder had put really crap memory in those slots. I've long since replaced them with a known-good brand, more than doubled the reported throughput, and very visibly improved the performance of my machine.

Why is memset slow?

Tags:

optimization

memcpy

memset

memory-bandwidth

Jeff Guy

1 Answers

jthill

Recent Activity

Donate For Us

Why is memset slow?

Tags:

optimization

memcpy

memset

memory-bandwidth

Jeff Guy

1 Answers

jthill

Related questions

Recent Activity

Donate For Us