In my application 20% of cpu time is spent on reading bits (skip
) through my bit reader. Does anyone have any idea on how one might make the following code faster? At any given time, I do not need more than 20 valid bits (which is why I, in some situations, can use fast_skip
).
Bits are read in big-endian order, which is why the byte swap is needed.
class bit_reader
{
std::uint32_t* m_data;
std::size_t m_pos;
std::uint64_t m_block;
public:
bit_reader(void* data)
: m_data(reinterpret_cast<std::uint32_t*>(data))
, m_pos(0)
, m_block(_byteswap_uint64(*reinterpret_cast<std::uint64_t*>(data)))
{
}
std::uint64_t value(std::size_t n_bits = 64)
{
assert(m_pos + n_bits < 64);
return (m_block << m_pos) >> (64 - n_bits);
}
void skip(std::size_t n_bits) // 20% cpu time
{
assert(m_pos + n_bits < 42);
m_pos += n_bits;
if(m_pos > 31)
{
m_block = _byteswap_uint64(reinterpret_cast<std::uint64_t*>(++m_data)[0]);
m_pos -= 32;
}
}
void fast_skip(std::size_t n_bits)
{
assert(m_pos + n_bits < 42);
m_pos += n_bits;
}
};
Target hardware is x64.
I see from an earlier comment you are unpacking Huffman/arithmetic coded streams in JPEG.
skip()
and value()
are really simple enough to be inlined. There's a chance that the compiler will keep the shift register and buffer pointers in registers the whole while. Making all pointers here and in the caller with the restrict
modifier might help by telling the compiler that you won't be writing the results of Huffman decoding into the bit-buffer, thus allowing further optimisation.skip()
is a heavy path is that you're calling it a lot. You are consuming an entire symbol at once rather than every bit here aren't you? There are some clever tricks you can do by counting leading 0 or 1s in symbols and table lookup.value()
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With