In the below program, I have 2 buffers, one which is 64byte aligned and another, which I am assuming is 16 byte aligned on my 64 Linux host running 2.6.x kernel.
The cache line is 64byte long. So, in this program, I simply access one cache line at a time. I was hoping to see posix_memaligned
to be equal if not faster than the non aligned buffer.
Here are some metrics
./readMemory 10000000
time taken by posix_memaligned buffer: 293020299
time taken by standard buffer: 119724294
./readMemory 100000000
time taken by posix_memaligned buffer: 548849137
time taken by standard buffer: 211197082
#include <errno.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <linux/time.h>
void now(struct timespec * t);
int main(int argc, char **argv)
{
char *buf;
struct timespec st_time, end_time;
int runs;
if (argc !=2)
{
printf("Usage: ./readMemory <number of runs>\n");
exit(1);
}
errno = 0;
runs = strtol(argv[1], NULL, 10);
if (errno !=0) {
printf("Invalid number of runs: %s \n", argv[1]);
exit(1);
}
int returnVal = -1;
returnVal = posix_memalign((void **)&buf, 64, 1024);
if (returnVal != 0)
{
printf("error in posix_memaligh\n");
}
char tempBuf[64];
char * temp = buf;
size_t cpyBytes = 64;
now(&st_time);
for(int x=0; x<runs; x++) {
temp = buf;
for(int i=0; i < ((1024/64) -1); i+=64)
{
memcpy(tempBuf, temp, cpyBytes);
temp += 64;
}
}
now(&end_time);
printf("time taken by posix_memaligned buffer: %ld \n", (end_time.tv_nsec - st_time.tv_nsec));
char buf1[1024];
temp = buf1;
now(&st_time);
for(int x=0; x<runs; x++)
{
temp = buf1;
for(int i=0; i < ((1024/64) -1); i+=64)
{
memcpy(tempBuf, temp, cpyBytes);
temp += 64;
}
}
now(&end_time);
printf("time taken by standard buffer: %ld \n", (end_time.tv_nsec - st_time.tv_nsec));
return 0;
}
void now(struct timespec *tnow)
{
if(clock_gettime(CLOCK_MONOTONIC_RAW, tnow) <0 )
{
printf("error getting time");
exit(1);
}
}
The disassembly for first loop is
movq -40(%rbp), %rdx
movq -48(%rbp), %rcx
leaq -176(%rbp), %rax
movq %rcx, %rsi
movq %rax, %rdi
call memcpy
addq $64, -48(%rbp)
addl $64, -20(%rbp)
The disassembly of second loop is
movq -40(%rbp), %rdx
movq -48(%rbp), %rcx
leaq -176(%rbp), %rax
movq %rcx, %rsi
movq %rax, %rdi
call memcpy
addq $64, -48(%rbp)
addl $64, -4(%rbp)
It's possible that the reason is the relative alignment of the buffers.
memcpy
works fastest when copying word-aligned data (32/64 bits).
If both buffers are well-aligned, all is OK.
If both buffers are mis-aligned the same way, memcpy
handles it by copying a small-prefix byte by byte, then running word by word on the remainder.
But if one buffer is word-aligned and the other isn't, there's no way to have both reads and writes word aligned. So memcpy
still works word by word, but one half of the memory accesses are badly aligned.
If both your stack buffers are unaligned the same way (e.g. both addresses are 8*x+2), but the buffer from posix_memalign
is aligned, it can explain what you see.
There are a few problems with your benchmark:
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With