Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

efficient bitwise sum calculation

Is there an efficient way to calculate a bitwise sum of uint8_t buffers (assume number of buffers are <= 255, so that we can make the sum uint8)? Basically I want to know how many bits are set at the i'th position of each buffer.

Ex: For 2 buffers

uint8 buf1[k] -> 0011 0001 ...
uint8 buf2[k] -> 0101 1000 ...
uint8 sum[k*8]-> 0 1 1 2 1 0 0 1... 

are there any BLAS or boost routines for such a requirement?

This is a highly vectorizable operation IMO.

UPDATE: Following is a naive impl of the requirement

for (auto&& buf: buffers){
  for (int i = 0; i < buf_len; i++){
    for (int b = 0; b < 8; ++b) {
      sum[i*8 + b] += (buf[i] >> b) & 1;
    }
  }
}
like image 540
n1r44 Avatar asked Oct 07 '21 15:10

n1r44


2 Answers

An alternative to OP's naive code:

Perform 8 additions at once. Use a lookup table to expand the 8 bits to 8 bytes with each bit to a corresponding byte - see ones[].

void sumit(uint8_t number_of_buf, uint8_t k, const uint8_t buf[number_of_buf][k]) {
  static const uint64_t ones[256] = { 0, 0x1, 0x100, 0x101, 0x10000, 0x10001, 
      /* 249 more pre-computed constants */ 0x0101010101010101};

  uint64_t sum[k];
  memset(sum, 0, sizeof sum):

  for (size_t buf_index = 0; buf_index < number_of_buf;  buf_index++) {
    for (size_t int i = 0; i < k; i++) {
      sum[i] += ones(buf[buf_index][i]);
    }
  }

  for (size_t int i = 0; i < k; i++) {
    for (size_t bit = 0; bit < 8;  bit++) {
      printf("%llu ", 0xFF & (sum[i] >> (8*bit)));
    }
  }
}

See also @Eric Postpischil.

like image 102
chux - Reinstate Monica Avatar answered Sep 18 '22 12:09

chux - Reinstate Monica


As a modification of chux's approach, the lookup table can be replaced with a vector shift and mask. Here's an example using GCC's vector extensions.

#include <stdint.h>
#include <stddef.h>

typedef uint8_t vec8x8 __attribute__((vector_size(8)));

void sumit(uint8_t number_of_buf,
           uint8_t k,
           const uint8_t buf[number_of_buf][k],
           vec8x8 * restrict sums) {
    static const vec8x8 shift = {0,1,2,3,4,5,6,7};

    for (size_t i = 0; i < k; i++) {
        sums[i] = (vec8x8){0};
        for (size_t buf_index = 0; buf_index < number_of_buf;  buf_index++) {
            sums[i] += (buf[buf_index][i] >> shift) & 1;
        }
    }
}

Try it on godbolt.

I interchanged the loops from chux's answer because it seemed more natural to accumulate the sum for one buffer index at a time (then the sum can be cached in a register throughout the inner loop). There might be a tradeoff in cache performance because we now have to read the elements of the two-dimensional buf in column-major order.

Taking ARM64 as an example, GCC 11.1 compiles the inner loop as follows.

// v1 = sums[i]
// v2 = {0,-1,-2,...,-7} (right shift is done as left shift with negative count)
// v3 = {1,1,1,1,1,1,1,1}
.L4:
        ld1r    {v0.8b}, [x1]         // replicate buf[buf_index][i] to all elements of v0
        add     x0, x0, 1             
        add     x1, x1, x20
        ushl    v0.8b, v0.8b, v2.8b   // shift
        and     v0.8b, v0.8b, v3.8b   // mask
        add     v1.8b, v1.8b, v0.8b   // accumulate
        cmp     x0, x19
        bne     .L4

I think it'd be more efficient to do two bytes at a time (so unrolling the loop on i by a factor of 2) and use 128-bit vector operations. I leave this as an exercise :)

It's not immediately clear to me whether this would end up being faster or slower than the lookup table. You might have to profile both on the target machine(s) of interest.

like image 31
Nate Eldredge Avatar answered Sep 16 '22 12:09

Nate Eldredge