Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

My attempt to optimize memset on a 64bit machine takes more time than standard implementation. Can someone please explain why?

Tags:

c

64-bit

memset

(machine is x86 64 bit running SL6)

I was trying to see if I can optimize memset on my 64 bit machine. As per my understanding memset goes byte by byte and sets the value. I assumed that if I do in units of 64 bits, it would be faster. But somehow it takes more time. Can someone take a look at my code and suggest why ?

/* Code */
#include <stdio.h>
#include <time.h>
#include <stdint.h>
#include <string.h>

void memset8(unsigned char *dest, unsigned char val, uint32_t count)
{
    while (count--)
        *dest++ = val;
}
void memset32(uint32_t *dest, uint32_t val, uint32_t count)
{
    while (count--)
        *dest++ = val;
}
void
memset64(uint64_t *dest, uint64_t val, uint32_t count)
{
    while (count--)
        *dest++ = val;
}
#define CYCLES 1000000000
int main()
{
    clock_t start, end;
    double total;
    uint64_t loop;
    uint64_t val;

    /* memset 32 */
    start = clock();
    for (loop = 0; loop < CYCLES; loop++) {
        val = 0xDEADBEEFDEADBEEF;
        memset32((uint32_t*)&val, 0, 2);
    }
    end = clock();
    total = (double)(end-start)/CLOCKS_PER_SEC;
    printf("Timetaken memset32 %g\n", total);

    /* memset 64 */
    start = clock();
    for (loop = 0; loop < CYCLES; loop++) {
        val = 0xDEADBEEFDEADBEEF;
        memset64(&val, 0, 1);
    }
    end = clock();
    total = (double)(end-start)/CLOCKS_PER_SEC;
    printf("Timetaken memset64 %g\n", total);

    /* memset 8 */
    start = clock();
    for (loop = 0; loop < CYCLES; loop++) {
        val = 0xDEADBEEFDEADBEEF;
        memset8((unsigned char*)&val, 0, 8);
    }
    end = clock();
    total = (double)(end-start)/CLOCKS_PER_SEC;
    printf("Timetaken memset8 %g\n", total);

    /* memset */
    start = clock();
    for (loop = 0; loop < CYCLES; loop++) {
        val = 0xDEADBEEFDEADBEEF;
        memset(&val, 0, 8);
    }
    end = clock();
    total = (double)(end-start)/CLOCKS_PER_SEC;
    printf("Timetaken memset %g\n", total);

    printf("-----------------------------------------\n");
}

/*Result*/
Timetaken memset32 12.46
Timetaken memset64 7.57
Timetaken memset8 37.12
Timetaken memset 6.03
-----------------------------------------

Looks like the standard memset is more optimized than my implementation. I tried looking into code and everywhere is see that implementation of memset is same as what I did for memset8. When I use memset8, the results are more like what I expect and very different from memset. Can someone suggest what am I doing wrong ?

like image 378
adikshit Avatar asked Jan 02 '14 22:01

adikshit


1 Answers

Actual memset implementations are typically hand-optimized in assembly, and use the widest aligned writes available on the targeted hardware. On x86_64 that will be at least 16B stores (movaps, for example). It may also take advantage of prefetching (this is less common recently, as most architectures have good automatic streaming prefetchers for regular access patterns), streaming stores or dedicated instructions (historically rep stos was unusably slow on x86, but it is quite fast on recent microarchitectures). Your implementation does none of these things. It should not be terribly surprising that the system implementation is faster.

As an example, consider the implementation used in OS X 10.8 (which has been superseded in 10.9). Here’s the core loop for modest-sized buffers:

.align 4,0x90
1:  movdqa %xmm0,   (%rdi,%rcx)
    movdqa %xmm0, 16(%rdi,%rcx)
    movdqa %xmm0, 32(%rdi,%rcx)
    movdqa %xmm0, 48(%rdi,%rcx)
    addq   $64,      %rcx
    jne    1b

This loop will saturate the LSU when hitting cache on pre-Haswell microarchitectures at 16B/cycle. An implementation based on 64-bit stores like your memset64 cannot exceed 8B/cycle (and may not even achieve that, depending on the microarchitecture in question and whether or not the compiler unrolls your loop). On Haswell, an implementation that uses AVX stores or rep stos can go even faster and achieve 32B/cycle.

like image 199
Stephen Canon Avatar answered Nov 15 '22 06:11

Stephen Canon