Optimizing Array Compaction

Question

Let's say I have an array k = [1 2 0 0 5 4 0]

I can compute a mask as follows m = k > 0 = [1 1 0 0 1 1 0]

Using only the mask m and the following operations

Shift left / right
And/Or
Add/Subtract/Multiply

I can compact k into the following [1 2 5 4]

Here's how I currently do it (MATLAB pseudocode):

function out = compact( in )
    d = in
    for i = 1:size(in, 2) %do (# of items in in) passes
        m = d > 0
        %shift left, pad w/ 0 on right
        ml = [m(2:end) 0] % shift
        dl = [d(2:end) 0] % shift

        %if the data originally has a gap, fill it in w/ the 
        %left shifted one
        use = (m == 0) & (ml == 1) %2 comparison  

        d = use .* dl + ~use .* d

        %zero out elements that have been moved to the left
        use_r = [0 use(1:end-1)]
        d = d .* ~use_r
    end

    out = d(1 : size(find(in > 0), 2)) %truncate the end
end

Intuition

Each iteration, we shift the mask left and compare the mask. We set a index to have the left shifted data if we find that after this shift, an index that was originally void(mask[i] = 0) is now valid(mask[i] = 1).

Question

The above algorithm has O(N * (3 shift + 2 comparison + AND + add + 3 multiplies)). Is there a way to improve its efficiency?

Evgeny Kluev · Accepted Answer

Original code moves array element only one step at a time. This may be improved. It is possible to group array elements and shift them 2^k steps at once.

First part of this algorithm computes how many steps should each element be shifted. Second part moves elements - first by one step, then by 2, then 4, etc. This works correctly and elements are not intermixed because after each shift there is enough space to perform 2 times larger shift.

Matlab, code not tested:

function out = compact( in )
    m = in <= 0
    for i = 1:size(in, 2)-1
        m = [0 m(1:end-1)]
        s = s + m
    end

    d = in
    shift = 1
    for j = 1:ceil(log2(size(in, 2)))
        s1 = rem(s, 2)
        s = (s - s1) / 2
        d = (d .* ~s1) + ([d(1+shift:end) zeros(1,shift)] .* [s1(1+shift:end) zeros(1,shift)])
        shift = shift*2
    end
    out = d
end

The above algorithm's complexity is O(N * (1 shift + 1 add) + log(N) * (1 rem + 2 add + 3 mul + 2 shift)).

Louis Ricci · Answer

So you need to figure out if the extra parallelism, shifting/shuffling overhead is worth it for such a simple task.

for(int inIdx = 0, outIdx = 0; inIdx < inLength; inIdx++) {
 if(mask[inIdx] == 1) {
  out[outIdx] = in[inIdx];
  outIdx++;
 }
}

If you want to go the parallel SIMD route your best bet is a SWITCH CASE with all of the possible permutations of the next 4 bits of the mask. Why not 8? because the PSHUFD instruction can only shuffle on XMMX m128 not YMMX m256.

So you make 16 Cases:

[1 1 1 1], [1 1 1 0], [1 1 0 0], [1 0 0 0], [0 0 0 0] don't need any special shift/shuffle you just copy the input to the output MOVDQU and increment the output pointer by 4, 3, 2, 1, 0 respectively.
[0 1 1 1], [0 0 1 1], [0 1 1 0], [0 0 0 1], [0 1 0 0], [0 0 1 0] you just need to use PSRLx (shift right logical) and increment the output pointer by 3, 2, 2, 1, 1, 1 respectively
[1 0 0 1], [1 0 1 0], [0 1 0 1], [1 0 1 1], [1 1 0 1] you use the PSHUFD to pack your input then increment your output pointer by 2, 2, 2, 3, 3 respectively.

So every case would be a minimal amount of processing (1 to 2 SIMD instructions and 1 output pointer addition). The surrounding loop of the case statements would handle the constant input pointer addition (by 4) and the MOVDQA to load the input.

Antti Huima · Answer

Reading the comments below the original question, in the actual problem the array contains 32-bit floating point numbers, and the mask is (one?) 32-bit integer, so I don't get it why shifts etc. should be used for compacting the array. The simple compacting algorithm (in C) would be something like this:

float array[8];
unsigned int mask = ...;
int a = 0, b = 0;
while (mask) {
  if (mask & 1) { array[a++] = array[b]; }
  b++;
  mask >>= 1;
}
/* Size of compacted array is 'a' */
/* Optionally clear the rest: */
while (a < 8) array[a++] = 0.0;

Minor variations would be due to the bit order of the mask, but the only ALU operations that are needed are index variable updates and shifting and ANDing the mask. Because the original array is at least 256 bits wide, no usual CPU can shift the whole array bit-wise around.

Optimizing Array Compaction

Tags:

algorithm

matlab

simd

sse

jameszhao00

3 Answers

Evgeny Kluev

Louis Ricci

Antti Huima

Recent Activity

Donate For Us

Optimizing Array Compaction

Tags:

algorithm

matlab

simd

sse

jameszhao00

3 Answers

Evgeny Kluev

Louis Ricci

Antti Huima

Related questions

Recent Activity

Donate For Us