Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why is accessing a memory aligned buffer more expensive in Linux?

In the below program, I have 2 buffers, one which is 64byte aligned and another, which I am assuming is 16 byte aligned on my 64 Linux host running 2.6.x kernel.

The cache line is 64byte long. So, in this program, I simply access one cache line at a time. I was hoping to see posix_memaligned to be equal if not faster than the non aligned buffer. Here are some metrics

./readMemory 10000000

time taken by posix_memaligned buffer: 293020299 
time taken by standard buffer: 119724294 

./readMemory 100000000

time taken by posix_memaligned buffer: 548849137 
time taken by standard buffer: 211197082 

#include <errno.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <linux/time.h>

void now(struct timespec * t);

int main(int argc, char **argv)
{        
  char *buf;        
  struct timespec st_time, end_time;        
  int runs;        
  if (argc !=2) 
  {
             printf("Usage: ./readMemory <number of runs>\n");                
             exit(1);        
  }        
  errno = 0;        
  runs = strtol(argv[1], NULL, 10);        
  if (errno !=0)        {
            printf("Invalid number of runs: %s \n", argv[1]);
            exit(1);
    }

    int returnVal = -1;

    returnVal = posix_memalign((void **)&buf, 64, 1024);
    if (returnVal != 0)
    {
            printf("error in posix_memaligh\n");
    }

    char tempBuf[64];
    char * temp = buf;

    size_t cpyBytes = 64;

    now(&st_time);
    for(int x=0; x<runs; x++) {
    temp = buf;
    for(int i=0; i < ((1024/64) -1); i+=64)
    {
            memcpy(tempBuf, temp, cpyBytes);
            temp += 64;
    }
    }
    now(&end_time);

    printf("time taken by posix_memaligned buffer: %ld \n", (end_time.tv_nsec - st_time.tv_nsec));

    char buf1[1024];        
    temp = buf1;        
    now(&st_time);        
    for(int x=0; x<runs; x++) 
    {        
      temp = buf1;        
      for(int i=0; i < ((1024/64) -1); i+=64)        
     {                
        memcpy(tempBuf, temp, cpyBytes);                
        temp += 64;        
      }          
    }        
    now(&end_time);        
    printf("time taken by standard buffer: %ld \n", (end_time.tv_nsec - st_time.tv_nsec));
    return 0;
}

void now(struct timespec *tnow)
{
    if(clock_gettime(CLOCK_MONOTONIC_RAW, tnow) <0 )
    {
            printf("error getting time");
            exit(1);
    }
}

The disassembly for first loop is

    movq    -40(%rbp), %rdx        
    movq    -48(%rbp), %rcx        
    leaq    -176(%rbp), %rax
    movq    %rcx, %rsi
    movq    %rax, %rdi
    call    memcpy
    addq    $64, -48(%rbp)
    addl    $64, -20(%rbp)

The disassembly of second loop is

    movq    -40(%rbp), %rdx
    movq    -48(%rbp), %rcx
    leaq    -176(%rbp), %rax
    movq    %rcx, %rsi
    movq    %rax, %rdi
    call    memcpy
    addq    $64, -48(%rbp)
    addl    $64, -4(%rbp)
like image 715
Jimm Avatar asked Nov 21 '12 06:11

Jimm


2 Answers

It's possible that the reason is the relative alignment of the buffers.

memcpy works fastest when copying word-aligned data (32/64 bits).
If both buffers are well-aligned, all is OK.
If both buffers are mis-aligned the same way, memcpy handles it by copying a small-prefix byte by byte, then running word by word on the remainder.

But if one buffer is word-aligned and the other isn't, there's no way to have both reads and writes word aligned. So memcpy still works word by word, but one half of the memory accesses are badly aligned.

If both your stack buffers are unaligned the same way (e.g. both addresses are 8*x+2), but the buffer from posix_memalign is aligned, it can explain what you see.

like image 148
ugoren Avatar answered Nov 14 '22 22:11

ugoren


There are a few problems with your benchmark:

  • Your run-time is too short, hence you may be seeing a lot of noise/jitter.
  • If you have CPU frequency scaling enabled the first loop may be executing before the CPU switches into full/turbo frequency. You need to warm up the CPU first or, better, turn off the frequency scaling during benchmarking.
  • You may be observing scheduling because you are not running with real-time priority.
  • Each run you get only one sample, you'd need at the very least 30 runs to be in a position to make any kind of scientific judgment (a scientific study with one sample is commonly called an anecdote).
like image 42
Maxim Egorushkin Avatar answered Nov 14 '22 23:11

Maxim Egorushkin