I need a fast way to get the position of all one bits in a 64-bit integer. For example, given x = 123703
, I'd like to fill an array idx[] = {0, 1, 2, 4, 5, 8, 9, 13, 14, 15, 16}
. We can assume we know the number of bits a priori. This will be called 1012 - 1015 times, so speed is of the essence. The fastest answer I've come up with so far is the following monstrosity, which uses each byte of the 64-bit integer as an index into tables that give the number of bits set in that byte and the positions of the ones:
int64_t x; // this is the input unsigned char idx[K]; // this is the array of K bits that are set unsigned char *dst=idx, *src; unsigned char zero, one, two, three, four, five; // these hold the 0th-5th bytes zero = x & 0x0000000000FFUL; one = (x & 0x00000000FF00UL) >> 8; two = (x & 0x000000FF0000UL) >> 16; three = (x & 0x0000FF000000UL) >> 24; four = (x & 0x00FF00000000UL) >> 32; five = (x & 0xFF0000000000UL) >> 40; src=tab0+tabofs[zero ]; COPY(dst, src, n[zero ]); src=tab1+tabofs[one ]; COPY(dst, src, n[one ]); src=tab2+tabofs[two ]; COPY(dst, src, n[two ]); src=tab3+tabofs[three]; COPY(dst, src, n[three]); src=tab4+tabofs[four ]; COPY(dst, src, n[four ]); src=tab5+tabofs[five ]; COPY(dst, src, n[five ]);
where COPY
is a switch statement to copy up to 8 bytes, n
is array of the number of bits set in a byte and tabofs
gives the offset into tabX
, which holds the positions of the set bits in the X-th byte. This is about 3x faster than unrolled loop-based methods with (See below.) I am currently iterating __builtin_ctz()
on my Xeon E5-2609.x
in lexicographical order for a given number of bits set.
Is there a better way?
EDIT: Added an example (that I have subsequently fixed). Full code is available here: http://pastebin.com/79X8XL2P . Note: GCC with -O2 seems to optimize it away, but Intel's compiler (which I used to compose it) doesn't...
Also, let me give some additional background to address some of the comments below. The goal is to perform a statistical test on every possible subset of K variables out of a universe of N possible explanatory variables; the specific target right now is N=41, but I can see some projects needing N up to 45-50. The test basically involves factorizing the corresponding data submatrix. In pseudocode, something like this:
double doTest(double *data, int64_t model) { int nidx, idx[]; double submatrix[][]; nidx = getIndices(model, idx); // get the locations of ones in model // copy data into submatrix for(int i=0; i<nidx; i++) { for(int j=0; j<nidx; j++) { submatrix[i][j] = data[idx[i]][idx[j]]; } } factorize(submatrix, nidx); return the_answer; }
I coded up a version of this for an Intel Phi board that should complete the N=41 case in about 15 days, of which ~5-10% of the time is spent in a naive getIndices()
so right off the bat a faster version could save a day or more. I'm working on an implementation for NVidia Kepler too, but unfortunately the problem I have (ludicrous numbers of small matrix operations) is not ideally suited to the hardware (ludicrously large matrix operations). That said, this paper presents a solution that seems to achieve hundreds of GFLOPS/s on matrices of my size by aggressively unrolling loops and performing the entire factorization in registers, with the caveat that the dimensions of the matrix be defined at compile-time. (This loop unrolling should help reduce overhead and improve vectorization in the Phi version too, so getIndices()
will become more important!) So now I'm thinking my kernel should look more like:
double *data; // move data to GPU/Phi once into shared memory template<unsigned int K> double doTestUnrolled(int *idx) { double submatrix[K][K]; // copy data into submatrix #pragma unroll for(int i=0; i<K; i++) { #pragma unroll for(int j=0; j<K; j++) { submatrix[i][j] = data[idx[i]][idx[j]]; } } factorizeUnrolled<K>(submatrix); return the_answer; }
The Phi version solves each model in a `cilk_for' loop from model=0 to 2N (or, rather, a subset for testing), but now in order to batch work for the GPU and amortize the kernel launch overhead I have to iterate model numbers in lexicographical order for each of K=1 to 41 bits set (as doynax noted).
EDIT 2: Now that vacation is over, here are some results on my Xeon E5-2602 using icc version 15. The code that I used to benchmark is here: http://pastebin.com/XvrGQUat. I perform the bit extraction on integers that have exactly K bits set, so there is some overhead for the lexicographic iteration measured in the "Base" column in the table below. These are performed 230 times with N=48 (repeating as necessary).
"CTZ" is a loop that uses the the gcc intrinsic __builtin_ctzll
to get the lowest order bit set:
for(int i=0; i<K; i++) { idx[i] = __builtin_ctzll(tmp); lb = tmp & -tmp; // get lowest bit tmp ^= lb; // remove lowest bit from tmp }
Mark is Mark's branchless for loop:
for(int i=0; i<K; i++) { *dst = i; dst += x & 1; x >>= 1; }
Tab1 is my original table-based code with the following copy macro:
#define COPY(d, s, n) \ switch(n) { \ case 8: *(d++) = *(s++); \ case 7: *(d++) = *(s++); \ case 6: *(d++) = *(s++); \ case 5: *(d++) = *(s++); \ case 4: *(d++) = *(s++); \ case 3: *(d++) = *(s++); \ case 2: *(d++) = *(s++); \ case 1: *(d++) = *(s++); \ case 0: break; \ }
Tab2 is the same code as Tab1, but the copy macro just moves 8 bytes as a single copy (taking ideas from doynax and Lưu Vĩnh Phúc... but note this does not ensure alignment):
#define COPY2(d, s, n) { *((uint64_t *)d) = *((uint64_t *)s); d+=n; }
Here are the results. I guess my initial claim that Tab1 is 3x faster than CTZ only holds for large K (where I was testing). Mark's loop is faster than my original code, but getting rid of the branch in the COPY2
macro takes the cake for K > 8.
K Base CTZ Mark Tab1 Tab2 001 4.97s 6.42s 6.66s 18.23s 12.77s 002 4.95s 8.49s 7.28s 19.50s 12.33s 004 4.95s 9.83s 8.68s 19.74s 11.92s 006 4.95s 16.86s 9.53s 20.48s 11.66s 008 4.95s 19.21s 13.87s 20.77s 11.92s 010 4.95s 21.53s 13.09s 21.02s 11.28s 015 4.95s 32.64s 17.75s 23.30s 10.98s 020 4.99s 42.00s 21.75s 27.15s 10.96s 030 5.00s 100.64s 35.48s 35.84s 11.07s 040 5.01s 131.96s 44.55s 44.51s 11.58s
I believe the key to performance here is to focus on the larger problem rather than on micro-optimizing the extraction of bit positions out of a random integer.
Judging by your sample code and previous SO question you are enumerating all words with K bits set in order, and extracting the bit indices out of these. This greatly simplifies matters.
If so then instead of rebuilding the bit position each iteration try directly incrementing the positions in the bit array. Half of the time this will involve a single loop iteration and increment.
Something along these lines:
// Walk through all len-bit words with num-bits set in order void enumerate(size_t num, size_t len) { size_t i; unsigned int bitpos[64 + 1]; // Seed with the lowest word plus a sentinel for(i = 0; i < num; ++i) bitpos[i] = i; bitpos[i] = 0; // Here goes the main loop do { // Do something with the resulting data process(bitpos, num); // Increment the least-significant series of consecutive bits for(i = 0; bitpos[i + 1] == bitpos[i] + 1; ++i) bitpos[i] = i; // Stop on reaching the top } while(++bitpos[i] != len); } // Test function void process(const unsigned int *bits, size_t num) { do printf("%d ", bits[--num]); while(num); putchar('\n'); }
Not particularly optimized but you get the general idea.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With