Fast merge of sorted subsets of 4K floating-point numbers in L1/L2

Question

What is a fast way to merge sorted subsets of an array of up to 4096 32-bit floating point numbers on a modern (SSE2+) x86 processor?

Please assume the following:

The size of the entire set is at maximum 4096 items
The size of the subsets is open to discussion, but let us assume between 16-256 initially
All data used through the merge should preferably fit into L1
The L1 data cache size is 32K. 16K has already been used for the data itself, so you have 16K to play with
All data is already in L1 (with as high degree of confidence as possible) - it has just been operated on by a sort
All data is 16-byte aligned
We want to try to minimize branching (for obvious reasons)

Main criteria of feasibility: faster than an in-L1 LSD radix sort.

I'd be very interested to see if someone knows of a reasonable way to do this given the above parameters! :)

awdz9nld · Accepted Answer

Here's a very naive way to do it. (Please excuse any 4am delirium-induced pseudo-code bugs ;)

//4x sorted subsets
data[4][4] = {
  {3, 4, 5, INF},
  {2, 7, 8, INF},
  {1, 4, 4, INF},
  {5, 8, 9, INF}
}

data_offset[4] = {0, 0, 0, 0}

n = 4*3

for(i=0, i<n, i++):
  sub = 0
  sub = 1 * (data[sub][data_offset[sub]] > data[1][data_offset[1]])
  sub = 2 * (data[sub][data_offset[sub]] > data[2][data_offset[2]])
  sub = 3 * (data[sub][data_offset[sub]] > data[3][data_offset[3]])

  out[i] = data[sub][data_offset[sub]]
  data_offset[sub]++

Edit:
With AVX2 and its gather support, we could compare up to 8 subsets at once.

Edit 2:
Depending on type casting, it might be possible to shave off 3 extra clock cycles per iteration on a Nehalem (mul: 5, shift+sub: 4)

//Assuming 'sub' is uint32_t
sub = ... << ((data[sub][data_offset[sub]] > data[...][data_offset[...]]) - 1)

Edit 3:
It may be possible to exploit out-of-order execution to some degree, especially as K gets larger, by using two or more max values:

max1 = 0
max2 = 1
max1 = 2 * (data[max1][data_offset[max1]] > data[2][data_offset[2]])
max2 = 3 * (data[max2][data_offset[max2]] > data[3][data_offset[3]])
...
max1 = 6 * (data[max1][data_offset[max1]] > data[6][data_offset[6]])
max2 = 7 * (data[max2][data_offset[max2]] > data[7][data_offset[7]])

q = data[max1][data_offset[max1]] < data[max2][data_offset[max2]]

sub = max1*q + ((~max2)&1)*q

Edit 4:

Depending on compiler intelligence, we can remove multiplications altogether using the ternary operator:

sub = (data[sub][data_offset[sub]] > data[x][data_offset[x]]) ? x : sub

Edit 5:

In order to avoid costly floating point comparisons, we could simply reinterpret_cast<uint32_t*>() the data, as this would result in an integer compare.

Another possibility is to utilize SSE registers as these are not typed, and explicitly use integer comparison instructions.

This works due to the operators < > == yielding the same results when interpreting a float on the binary level.

Edit 6:

If we unroll our loop sufficiently to match the number of values to the number of SSE registers, we could stage the data that is being compared.

At the end of an iteration we would then re-transfer the register which contained the selected maximum/minimum value, and shift it.

Although this requires reworking the indexing slightly, it may prove more efficient than littering the loop with LEA's.

stark · Answer

This is more of a research topic, but I did find this paper which discusses minimizing branch mispredictions using d-way merge sort.

Generic Human · Answer

SIMD sorting algorithms have already been studied in detail. The paper Efficient Implementation of Sorting on Multi-Core SIMD CPU Architecture describes an efficient algorithm for doing what you describe (and much more).

The core idea is that you can reduce merging two arbitrarily long lists to merging blocks of k consecutive values (where k can range from 4 to 16): the first block is z[0] = merge(x[0], y[0]).lo. To obtain the second block, we know that the leftover merge(x[0], y[0]).hi contains nx elements from x and ny elements from y, with nx+ny == k. But z[1] cannot contain elements from both x[1] and y[1], because that would require z[1] to contain more than nx+ny elements: so we just have to find out which of x[1] and y[1] needs to be added. The one with the lower first element will necessarily appear first in z, so this is simply done by comparing their first element. And we just repeat that until there is no more data to merge.

Pseudo-code, assuming the arrays end with a +inf value:

a := *x++
b := *y++
while not finished:
    lo,hi := merge(a,b)
    *z++ := lo
    a := hi
    if *x[0] <= *y[0]:
        b := *x++
    else:
        b := *y++

(note how similar this is to the usual scalar implementation of merging)

The conditional jump is of course not necessary in an actual implementation: for example, you could conditionally swap x and y with an xor trick, and then read unconditionally *x++.

merge itself can be implemented with a bitonic sort. But if k is low, there will be a lot of inter-instruction dependencies resulting in high latency. Depending on the number of arrays you have to merge, you can then choose k high enough so that the latency of merge is masked, or if this is possible interleave several two-way merges. See the paper for more details.

Edit: Below is a diagram when k = 4. All asymptotics assume that k is fixed.

The big gray box is merging two arrays of size n = m * k (in the picture, m = 3).
1. We operate on blocks of size k.
2. The "whole-block merge" box merges the two arrays block-by-block by comparing their first elements. This is a linear time operation, and it doesn't consume memory because we stream the data to the rest of the block. The performance doesn't really matter because the latency is going to be limited by the latency of the "merge4" blocks.
3. Each "merge4" box merges two blocks, outputs the lower k elements, and feeds the upper k elements to the next "merge4". Each "merge4" box performs a bounded number of operations, and the number of "merge4" is linear in n.
4. So the time cost of merging is linear in n. And because "merge4" has a lower latency than performing 8 serial non-SIMD comparisons, there will be a large speedup compared to non-SIMD merging.
Finally, to extend our 2-way merge to merge many arrays, we arrange the big gray boxes in classical divide-and-conquer fashion. Each level has complexity linear in the number of elements, so the total complexity is O(n log (n / n0)) with n0 the initial size of the sorted arrays and n is the size of the final array.

Jim Mischel · Answer

The most obvious answer that comes to mind is a standard N-way merge using a heap. That'll be O(N log k). The number of subsets is between 16 and 256, so the worst case behavior (with 256 subsets of 16 items each) would be 8N.

Cache behavior should be ... reasonable, although not perfect. The heap, where most of the action is, will probably remain in the cache throughout. The part of the output array being written to will also most likely be in the cache.

What you have is 16K of data (the array with sorted subsequences), the heap (1K, worst case), and the sorted output array (16K again), and you want it to fit into a 32K cache. Sounds like a problem, but perhaps it isn't. The data that will most likely be swapped out is the front of the output array after the insertion point has moved. Assuming that the sorted subsequences are fairly uniformly distributed, they should be accessed often enough to keep them in the cache.

Fast merge of sorted subsets of 4K floating-point numbers in L1/L2

Tags:

c++

performance

c

algorithm

assembly

awdz9nld

4 Answers

awdz9nld

stark

Generic Human

Jim Mischel

Recent Activity

Donate For Us

Fast merge of sorted subsets of 4K floating-point numbers in L1/L2

Tags:

c++

performance

c

algorithm

assembly

awdz9nld

4 Answers

awdz9nld

stark

Generic Human

Jim Mischel

Related questions

Recent Activity

Donate For Us