Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

C/C++ - posix_memalign()

I did some reading on cache misses optimization and come to know this stdlib function. It does some kind of memory alignment for optimization, but can any1 help me explain what this function really does? It takes 3 arguments: void* * memptr, size_t alignment, size_t size

The part that I don't get is what the documentation means by

"allocated size byte aligned on a boundary specified by alignment..."

What I understood from the reading is the function kind of allocate a block of memory with size of size, but after that, I don't get what they means by "boundary"... Is it the memory block being dissect into smaller chunk with size of alignment?

Here is the documentation: http://www.opengroup.org/onlinepubs/9699919799/functions/posix_memalign.html

like image 630
user385261 Avatar asked Jul 30 '10 09:07

user385261


1 Answers

I did some reading on cache misses optimization and come to know this stdlib function. It does some kind of memory alignment for optimization, but can any1 help me explain what this function really does?

The main purpose of the function is to allocate a buffer aligned to the page size. That is rarely made for performance - normally because a buffer suitable for a device driver/direct hardware access is required.

Lion share of performance vs. memory alignment problems are already resolved by the compilers themselves. E.g. all basic types - char, short, int, long - are already positioned in memory (or inside of struct) on their natural alignment: the address of a variable (or struct's field) is divisible by the size of the variable. To achieve that the padding is used. (E.g. in char a; int b; after the a, sizeof(char)-sizeof(int) bytes would be added to make sure that the b's address is aligned on sizeof(b).)

I don't get what they means by "boundary"... Is it the memory block being dissect into smaller chunk with size of alignment?

H/W devices (esp. non-PCI ones) often see the memory as blocks on N bytes and can access only the N bytes at a time. Boundary in the context means the start of a block, as in "block boundary".

Now, reluctantly, I mention the impact of alignment on performance. Remember, premature optimization is a root of all evil. The tricks are highly platform and CPU specific, thus generally should no be used:

  • Page size alignment is desired in some cases when you want to improve locality of your data. CPUs to translate virtual addresses to physical RAM locations maintain caches. Less pages code accesses, less stress that puts on the CPU. (Most OSs already try to optimize page layout of applications to minimize the overhead of virt to phys address translation.) If you know that your very very often accessed structure fits the single page, then it is might be advisable to put it into a page aligned storage to guarantee that it would contained within single page. malloc() doesn't provide the guaranteed and might put the structure so that it starts on one page and ends on another - crosses the page boundary - thus occupying two entries in TLB instead of desired single entry. (How to find page size.)

  • Cache line alignment. Though application can address memory in bytes, actually CPU can access physical RAM only the blocks, generally referred to as "cache line". This is smallest addressable unit of the physical RAM. By utilizing cache line alignment of a structure, one aims to minimize cache foot print and cache misses of the code. Cache line size of DRAM/DDR is 16 bytes. It can be greater (32 or 64 bytes) if platform's memory controller has wider data bus and accesses several memory modules in parallel. Same logic (as for page alignment) applies here too: if you put e.g. structure fields often accessed as a group together, aligned on the cache line size, you can minimize cache footprint of the data drastically. Simplest example would be a std::map< struct aaa *, void * >. If the struct aaa contains lots of fields, to minimize cache footprint one would put all fields used for comparison (key fields) at the beginning of the struct. If the key fields are spread over the structure, comparison would touch in worst case a cache line per key field. If the key fields are grouped together at the beginning of the struct, then comparison would likely touch much less cache lines. Less cache lines data needs, more cache is left for the rest of the application. Cache line size generally not available to the applications, though it can be found out by utilizing various tricks.

I brushed a lot of little details to keep it relatively short. If you want to know more about it, then reading some CPU manual is advised. E.g. Intel has rather good developer's manuals.

like image 132
Dummy00001 Avatar answered Sep 29 '22 10:09

Dummy00001