Let's say you have a uint64_t and care only about the high order bit for each byte in your uint64_t. Like so:
uint32_t: 0000 ... 1000 0000 1000 0000 1000 0000 1000 0000 ---> 0000 1111
Is there a faster way than:
return
(
((x >> 56) & 128)+
((x >> 49) & 64)+
((x >> 42) & 32)+
((x >> 35) & 16)+
((x >> 28) & 8)+
((x >> 21) & 4)+
((x >> 14) & 2)+
((x >> 7) & 1)
)
Aka shifting x, masking, and adding the correct bit for each byte? This will compile to a lot of assembly and I'm looking for a quicker way... The machine I'm using only has up to SSE2 instructions and I failed to find helpful SIMD ops.
Thanks for the help.
As I mentioned in a comment, pmovmskb
does what you want. Here's how you could use it:
MMX + SSE1:
movq mm0, input ; input can be r/m
pmovmskb output, mm0 ; output must be r
SSE2:
movq xmm0, input
pmovmskb output, xmm0
And I looked up the new way
BMI2:
mov rax, 0x8080808080808080
pext output, input, rax ; input must be r
return ((x & 0x8080808080808080) * 0x2040810204081) >> 56;
works. The & selects the bits you want to keep. The multiplications all the bits into the most significant byte, and the shift moves them to the least significant byte. Since multiplication is fast on most modern CPUs this shouldn't be much slower than using assembly.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With