I am running a memory access experiment in which a 2D matrix was used with each row being the size of a memory page. The experiment consists of reading every element using row/column major and then also writing to each element using row/column major. The matrix being accessed was declared with global scope to ease the programming requirements.
The point of this question is that with the test matrix being declared statically, the values are initialized to zero by the compiler and the results I found were quite interesting. When I did read operations first, i.e.
rowMajor_read();
colMajor_read();
rowMajor_write();
colMajor_write();
Then my colMajor_read operation finished very quickly.
However, if I do the write operations before reading we have:
rowMajor_write();
colMajor_write();
rowMajor_read();
colMajor_read();
And the column-major read operation has increased by nearly an order of magnitude.
I figured that it must have something to do with how the compiler optimizes the code. Since the global matrix was identically zero for every element, did the compiler completely remove the read operations? Or is it somehow "easier" to read a value from memory that is identically zero?
I do not pass any special compiler commands with respect to optimizations, but I did declare my functions in this manner.
inline void colMajor_read(){
register int row, col;
register volatile char temp __attribute__((unused));
for(col = 0; col < COL_COUNT; col++)
for(row = 0; row < ROW_COUNT; row++)
temp = testArray[row][col];
}
Because I was running into issues where the compiler completely removed the temp
variable from the above function since it was never being used. I think that having both volatile
and __attribute__((unused))
is redundant, but I included it nonetheless. I was under the impression that no optimizations were implemented on a volatile variable.
Any ideas?
I looked at the generated assembly and the results are identical for the colMajor_read function. The (assembly) non-inline version: http://pastebin.com/C8062fYB
Check the memory usage of your process before and after writing out values to the matrix. If it's stored in the .bss section on Linux, for example, the zeroed pages will be mapped to a single read-only page with copy-on-write semantics. So, even though you're reading through a bunch of addresses, you may be reading the same page of physical memory over and over.
This page http://madalanarayana.wordpress.com/2014/01/22/bss-segment/ has a good explanation.
If that's the case, zero out the matrix again afterward and rerun your read test and it should no longer be so much faster.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With