I was thinking about solving this, but it's looking to be quite a task. If I take this one by myself, I'll likely write it several different ways and pick the best, so I thought I'd ask this question to see if there's a good library that solves this already or if anyone has thoughts/advice.
void OffsetMemCpy(u8* pDest, u8* pSrc, u8 srcBitOffset, size size)
{
// Or something along these lines. srcBitOffset is 0-7, so the pSrc buffer
// needs to be up to one byte longer than it would need to be in memcpy.
// Maybe explicitly providing the end of the buffer is best.
// Also note that pSrc has NO alignment assumptions at all.
}
My application is time critical so I want to nail this with minimal overhead. This is the source of the difficulty/complexity. In my case, the blocks are likely to be quite small, perhaps 4-12 bytes, so big-scale memcpy stuff (e.g. prefetch) isn't that important. The best result would be the one that benches fastest for constant 'size' input, between 4 and 12, for randomly unaligned src buffers.
Anyone have, or know of, a similar implemented thing? Or does anyone want to take a stab at writing this, getting it to be as clean and efficient as possible?
Edit: It seems people are voting this "close" for "too broad". A few narrowing details would be AMD64 is the preferred architecture, so lets assume that. This means little endian etc. The implementation would hopefully fit well within the size of an answer so I don't think this is too broad. I'm asking for answers that are a single implementation at a time, even though there are a few approaches.
Memcpy copies data bytes by byte from the source array to the destination array. This copying of data is threadsafe.
memmove() is similar to memcpy() as it also copies data from a source to destination.
The memcpy() function in C++ copies specified bytes of data from the source to the destination. It is defined in the cstring header file.
memcpy() itself doesn't do any memory allocations. You delete what you new , and delete[] what you new[] . You do neither new nor new[] . Both source and destination arrays are allocated on the stack and will be automatically deallocated when then go out of scope.
I would start with a simple implementation such as this:
inline void OffsetMemCpy(uint8_t* pDest, const uint8_t* pSrc, const uint8_t srcBitOffset, const size_t size)
{
if (srcBitOffset == 0)
{
for (size_t i = 0; i < size; ++i)
{
pDest[i] = pSrc[i];
}
}
else if (size > 0)
{
uint8_t v0 = pSrc[0];
for (size_t i = 0; i < size; ++i)
{
uint8_t v1 = pSrc[i + 1];
pDest[i] = (v0 << srcBitOffset) | (v1 >> (CHAR_BIT - srcBitOffset));
v0 = v1;
}
}
}
(warning: untested code!).
Once this is working then profile it in your application - you may find it's plenty fast enough for your needs and thereby avoid the pitfalls of premature optimisation. If not then you have a useful baseline reference implementation for further optimisation work.
Be aware that for small copies the overhead of testing for alignment and word-sized copies etc may well outweigh any benefits, so a simple byte by byte loop such as the above may well be close to optimal.
Note also that optimisations may well be architecture-dependent - micro-optimisations which give a benefit on one CPU may well be counter-productive on another.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With