'memcpy'-like function that supports offsets by individual bits?

Tags:

I was thinking about solving this, but it's looking to be quite a task. If I take this one by myself, I'll likely write it several different ways and pick the best, so I thought I'd ask this question to see if there's a good library that solves this already or if anyone has thoughts/advice.

void OffsetMemCpy(u8* pDest, u8* pSrc, u8 srcBitOffset, size size)
{
    // Or something along these lines. srcBitOffset is 0-7, so the pSrc buffer 
    // needs to be up to one byte longer than it would need to be in memcpy.
    // Maybe explicitly providing the end of the buffer is best.
    // Also note that pSrc has NO alignment assumptions at all.
}

My application is time critical so I want to nail this with minimal overhead. This is the source of the difficulty/complexity. In my case, the blocks are likely to be quite small, perhaps 4-12 bytes, so big-scale memcpy stuff (e.g. prefetch) isn't that important. The best result would be the one that benches fastest for constant 'size' input, between 4 and 12, for randomly unaligned src buffers.

Memory should be moved in word sized blocks whenever possible
Alignment of these word sized blocks is important. pSrc is unaligned, so we may need to read a few bytes off the front until it is aligned.

Anyone have, or know of, a similar implemented thing? Or does anyone want to take a stab at writing this, getting it to be as clean and efficient as possible?

Edit: It seems people are voting this "close" for "too broad". A few narrowing details would be AMD64 is the preferred architecture, so lets assume that. This means little endian etc. The implementation would hopefully fit well within the size of an answer so I don't think this is too broad. I'm asking for answers that are a single implementation at a time, even though there are a few approaches.

704

asked Aug 17 '15 06:08

VoidStar

1 Answers

I would start with a simple implementation such as this:

inline void OffsetMemCpy(uint8_t* pDest, const uint8_t* pSrc, const uint8_t srcBitOffset, const size_t size)
{
    if (srcBitOffset == 0)
    {
        for (size_t i = 0; i < size; ++i)
        {
            pDest[i] = pSrc[i];
        }
    }
    else if (size > 0)
    {
        uint8_t v0 = pSrc[0];
        for (size_t i = 0; i < size; ++i)
        {
            uint8_t v1 = pSrc[i + 1];
            pDest[i] = (v0 << srcBitOffset) | (v1 >> (CHAR_BIT - srcBitOffset));
            v0 = v1;            
        }
    }
}

(warning: untested code!).

Once this is working then profile it in your application - you may find it's plenty fast enough for your needs and thereby avoid the pitfalls of premature optimisation. If not then you have a useful baseline reference implementation for further optimisation work.

Be aware that for small copies the overhead of testing for alignment and word-sized copies etc may well outweigh any benefits, so a simple byte by byte loop such as the above may well be close to optimal.

Note also that optimisations may well be architecture-dependent - micro-optimisations which give a benefit on one CPU may well be counter-productive on another.

158

answered Nov 15 '22 12:11

Paul R

Related questions
                            
                                Does the draw order affects objects position in depth? (images included)
                            
                                C++ Order of Evaluation of Subexpressions with Logical Operators
                            
                                How to increase throughput of Boost ASIO, UDP client application
                            
                                global declarations/initializations using static, const, constexpr
                            
                                Using inheritance to add functionality
                            
                                Is A Member Function Thread Safe?
                            
                                How to find and avoid uninitialised primitive members in C++?
                            
                                Qt removing stretches from a QHBoxLayout
                            
                                How to define a nested class outside its parent in C++
                            
                                Did I understand correctly the point of Scott Meyers' example of std::weak_ptr?
                            
                                openmp : check if nested parallesim
                            
                                Are there standard integer types with sizes being template parameters?
                            
                                I'm trying to print a Chinese character using the types wchar_t, char16_t and char32_t, to no avail.
                            
                                Is enable_if only C++11?
                            
                                Compile-time counter in template class
                            
                                OpenGL GLSL shaders on Mac does not compile
                            
                                Distinguishing between multiple exceptions of the same type
                            
                                One occasional writer, multiple frequent readers for std::map
                            
                                "anti-SFINAE" enabling an overload if a given expression is *not* well-formed
                            
                                max_element in lambda function [closed]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

'memcpy'-like function that supports offsets by individual bits?

Tags:

c++

c

optimization

bit-manipulation

memcpy

VoidStar

People also ask

1 Answers

Paul R

Recent Activity

Donate For Us