I have a column vector A which is 10 elements long. I have a matrix B which is 10 by 10. The memory storage for B is column major. I would like to overwrite the first row in B with the column vector A.
Clearly, I can do:
for ( int i=0; i < 10; i++ )
{
B[0 + 10 * i] = A[i];
}
where I've left the zero in 0 + 10 * i
to highlight that B uses column-major storage (zero is the row-index).
After some shenanigans in CUDA-land tonight, I had a thought that there might be a CPU function to perform a strided memcpy?? I guess at a low-level, performance would depend on the existence of a strided load/store instruction, which I don't recall there being in x86 assembly?
memmove() is similar to memcpy() as it also copies data from a source to destination.
The memcpy() function in C++ copies specified bytes of data from the source to the destination. It is defined in the cstring header file.
memcpy is only faster if: BOTH buffers, src AND dst, are 4-byte aligned. if so, memcpy() can copy a 32bit word at a time (inside its own loop over the length) if just one buffer is NOT 32bit word aligned - it creates overhead to figure out and it will do at the end a single char copy loop.
memcpy() itself doesn't do any memory allocations. You delete what you new , and delete[] what you new[] . You do neither new nor new[] . Both source and destination arrays are allocated on the stack and will be automatically deallocated when then go out of scope.
Short answer: The code you have written is as fast as it's going to get.
Long answer: The memcpy
function is written using some complicated intrinsics or assembly because it operates on memory operands that have arbitrary size and alignment. If you are overwriting a column of a matrix, then your operands will have natural alignment, and you won't need to resort to the same tricks to get decent speed.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With