Suppose we have an array of data and another array with indexes.
data = [1, 2, 3, 4, 5, 7]
index = [5, 1, 4, 0, 2, 3]
We want to create a new array from elements of data
at position from index
. Result should be
[4, 2, 5, 7, 3, 1]
Naive algorithm works for O(N) but it performs random memory access.
Can you suggest CPU cache friendly algorithm with the same complexity.
PS In my certain case all elements in data array are integers.
PPS Arrays might contain millions of elements.
PPPS I'm ok with SSE/AVX or any other x64 specific optimizations
Combine index and data into a single array. Then use some cache-friendly sorting algorithm to sort these pairs (by index). Then get rid of indexes. (You could combine merging/removing indexes with the first/last pass of the sorting algorithm to optimize this a little bit).
For cache-friendly O(N) sorting use radix sort with small enough radix
(at most half number of cache lines in CPU cache).
Here is C implementation of radix-sort-like algorithm:
void reorder2(const unsigned size)
{
const unsigned min_bucket = size / kRadix;
const unsigned large_buckets = size % kRadix;
g_counters[0] = 0;
for (unsigned i = 1; i <= large_buckets; ++i)
g_counters[i] = g_counters[i - 1] + min_bucket + 1;
for (unsigned i = large_buckets + 1; i < kRadix; ++i)
g_counters[i] = g_counters[i - 1] + min_bucket;
for (unsigned i = 0; i < size; ++i)
{
const unsigned dst = g_counters[g_index[i] % kRadix]++;
g_sort[dst].index = g_index[i] / kRadix;
g_sort[dst].value = g_input[i];
__builtin_prefetch(&g_sort[dst + 1].value, 1);
}
g_counters[0] = 0;
for (unsigned i = 1; i < (size + kRadix - 1) / kRadix; ++i)
g_counters[i] = g_counters[i - 1] + kRadix;
for (unsigned i = 0; i < size; ++i)
{
const unsigned dst = g_counters[g_sort[i].index]++;
g_output[dst] = g_sort[i].value;
__builtin_prefetch(&g_output[dst + 1], 1);
}
}
It differs from radix sort in two aspects: (1) it does not do counting passes because all counters are known in advance; (2) it avoids using power-of-2 values for radix.
This C++ code was used for benchmarking (if you want to run it on 32-bit system, slightly decrease kMaxSize
constant).
Here are benchmark results (on Haswell CPU with 6Mb cache):
It is easy to see that small arrays (below ~2 000 000 elements) are cache-friendly even for naive algorithm. Also you may notice that sorting approach starts to be cache-unfriendly at the last point on diagram (with size/radix
near 0.75 cache lines in L3 cache). Between these limits sorting approach is more efficient than naive algorithm.
In theory (if we compare only memory bandwidth needed for these algorithms with 64-byte cache lines and 4-byte values) sorting algorithm should be 3 times faster. In practice we have much smaller difference, about 20%. This could be improved if we use smaller 16-bit values for data
array (in this case sorting algorithm is about 1.5 times faster).
One more problem with sorting approach is its worst-case behavior when size/radix
is close to some power-of-2. This may be either ignored (because there are not so many "bad" sizes) or fixed by making this algorithm slightly more complicated.
If we increase number of passes to 3, all 3 passes use mostly L1 cache, but memory bandwidth is increased by 60%. I used this code to get experimental results: TL; DR. After determining (experimentally) the best radix value, I got somewhat better results for sizes greater than 4 000 000 (where 2-pass algorithm uses L3 cache for one pass) but somewhat worse results for smaller arrays (where 2-pass algorithm uses L2 cache for both passes). As it may be expected, performance is better for 16-bit data.
Conclusion: performance difference is much smaller than difference in complexity of algorithms, so naive approach is almost always better; if performance is very important and only 2 or 4 byte values are used, sorting approach is preferable.
data = [1, 2, 3, 4, 5, 7]
index = [5, 1, 4, 0, 2, 3]
We want to create a new array from elements of data at position from index. Result should be
result -> [4, 2, 5, 7, 3, 1]
I think, for a few million elements and on a single thread, the naive approach might be the best here.
Both data
and index
are accessed (read) sequentially, which is already optimal for the CPU cache. That leaves the random writing, but writing to memory isn't as cache friendly as reading from it anyway.
This would only need one sequential pass through data and index. And chances are some (sometimes many) of the writes will already be cache-friendly too.
result
- multiple threadsWe could allocate or use cache-friendly sized blocks for the result (blocks being regions in the result array
), and loop through index
and data
multiple times (while they stay in the cache).
In each loop we then only write elements to result
that fit in the current result-block. This would be 'cache friendly' for the writes too, but needs multiple loops (the number of loops could even get rather high - i.e. size of data / size of result-block
).
The above might be an option when using multiple threads: data
and index
, being read-only, would be shared by all cores at some level in the cache (depending on the cache architecture). The result
blocks in each thread would be totally independent (one core never has to wait for the result of another core, or a write in the same region). For example: 10 million elements - each thread could be working on an independent result block of say 500.000 elements (number should be a power of 2).
Combining the values as a pair and sorting them first: this would already take much more time than the naive option (and wouldn't be that cache friendly either).
Also, if there are only a few million of elements (integers), it won't make much of a difference. If we would be talking about billions, or data that doesn't fit in memory, other strategies might be preferable (like for example memory mapping the result set if it doesn't fit in memory).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With