For simplicity, assume I'm using a 32 bit little-endian processor and have declared the following 4-byte buffer:
unsigned char buffer[] = { 0xab, 0xcd, 0xef, 0x46 };
Let's say my goal is to bit-wise left shift each byte in the buffer by 4 bits. That is, I want to transform the buffer values to be:
{ 0xbc, 0xde, 0xf4, 0x60 }
. To perform such a transformation, one might write code like:
for (int i = 0; i < 3; ++i)
{
buffer[i] <<= 4;
buffer[i] |= (buffer[i + 1] >> 4);
}
buffer[3] <<= 4;
While this works, I'd much rather shift all 4 bytes simultaneously using my processor's native 32-bit registers:
unsigned char buffer[] = { 0xab, 0xcd, 0xef, 0x46 };
unsigned int *p = (unsigned int*)buffer; // unsigned int is 32 bit on my platform
*p <<= 4;
The above snippet successfully performs a shift, but not in the way I'm looking for. It appears that since I'm casting buffer to an unsigned int, the register is loaded (little-endian) with the value 0x46efcdab
(instead of 0xabcdef46
). Consequently, performing the 4-bit left shift results in 0xb0dafc6e
instead of 0xbcdef460
.
Aside from swapping bytes before the shift (e.g. htonl
et al.) are there any tricks for efficiently shifting bytes in the manner I am seeking?
Thank you in advance for your insights.
Use htonl
/ntohl
to flip between network (big-endian) byte order and native byte order:
uint32_t *p = (uint32_t*)buffer;
*p = htonl(ntohl(*p) << 4);
In effect, this loads the buffer contents as an integer in big-endian order, performs the shift, then writes it back in big-endian order.
This compiles into a couple of bswap
instructions on x86, so it should be reasonably efficient (gcc -O3
).
Here's some test code (buffer
is global to avoid constant-folding, and the return
prevents dead-code elimination):
#include <stdint.h> // uint32_t
#include <arpa/inet.h> // ntohl, htonl
unsigned char buffer[] = { 0xab, 0xcd, 0xef, 0x46 };
int main() {
uint32_t *p = (uint32_t*)buffer; // unsigned int is 32 bit on my platform
*p = htonl(ntohl(*p) << 4);
return *p;
}
This compiles into the following fairly simple machine code (x86-64; LLVM 7.0.2; cc -O2
):
0000000000000000 pushq %rbp ; frame setup
0000000000000001 movq %rsp, %rbp ; frame setup
0000000000000004 movl (%rip), %eax ; load buffer
000000000000000a bswapl %eax ; endian flip
000000000000000c shll $0x4, %eax ; shift
000000000000000f bswapl %eax ; endian flip
0000000000000011 movl %eax, (%rip) ; save buffer
0000000000000017 popq %rbp ; finish
0000000000000018 retq
Just for comparison, you can do this without the use of htonl
/ntohl
. This assumes a little-endian CPU:
#include <stdint.h>
void lshift(unsigned char* buf) {
uint32_t* p = (uint32_t*)buf;
uint32_t lo = *p & 0x0F0F0F0F;
uint32_t hi = *p & 0xF0F0F000;
*p = (lo << 4) | (hi >> 12);
}
And the generated assembly with gcc -O3
:
pushq %rbp
movq %rsp, %rbp
movl (%rdi), %eax
movl %eax, %ecx
shll $4, %ecx
andl $-252645136, %ecx ## imm = 0xFFFFFFFFF0F0F0F0
shrl $12, %eax
andl $986895, %eax ## imm = 0xF0F0F
orl %ecx, %eax
movl %eax, (%rdi)
popq %rbp
retq
Depending on how many cycles bswapl
is, it's likely the faster alternative.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With