I have:
final int ROWS = 100000;
final int COLS = 2000;
long[][] m = new long[COLS][ROWS];
and then:
public void xor(int row1, int row2) {
for (int col=0; col<COLS; col++) {
m[col][row1] ^= m[col][row2];
}
}
The above function is, simplified, what takes most of the time in a run. I was wondering if I should invest time to refactor my whole program to read "m = new long[ROWS][COLS]" (instead of the other way around) for better RAM access. Or won't I win much time with it?
I am aware that I could parallellize it with perhaps GPU's, but that's for a later stage.
In my opinion, it will definitely help to swap ROWS and COLS.
The layout of this arrays is (roughly) like this: [0][0], [0][1], [0][2],... [1][0], [1][1],... and so on. In your code, each column is a continuous chunk of memory, and a row is not.
Since each column is 800000 bytes, and in your xor method you access all of them, you are forcing more cache misses.
After transposing, each row becomes a continuous piece of memory, and since you tend to operate on rows, it should make it faster.
If you had long[][] m = new long[ROWS][COLS]; and for (int col=0; col<COLS; col++) m[row1][col] ^= m[row2][col];, you'd only need two 16000-byte-long rows to be in the cache during the execution of the xor method.
But since what I said is based mostly on theory, try to benchmark both variants and check which one is really faster.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With