I'm trying to optimize my code, taking advantage of multicore processors, to both copy any manipulate large dense arrays.
For copying: I have a large dense array (approximately 6000x100000) from which I need to pull 15x100000 subarrays to do several computations down the pipe. The pipe consists of a lot of linear algebra functions that are being handled by blas, which is multicore. Whether or not the time to pull data will really matter compared to the linear algebra is an open question, but I'd like to err on the side of caution and make sure the data copying is optimized.
For manipulating: I have many different functions that manipulate arrays by with element or row. It would be best if each of these was done multicore.
My question is: is it best to use to right framework (OpenML, OpenCL) and let all the magic happen with the compiler, or are there good functions/libraries that do this faster?
Your starting point should be good old memcpy
. Some tips from someone who has for a long time been obsessed by "copying performance".
memcpy
performance e.g memcpy_bench
function here.memcpy
when it's run on multiple cores e.g multi_memcpy_bench
here. (Unless you're on some multi-socket NUMA HW, I think you won't see much benefit to multithreaded copying).rep movsd
are long gone; last time I looked at gcc and Intel compiler's CRTs they both varied their strategy depending on the size of the copy relative to the CPU's cache size.movntps
) as these can achieve significant throughput improvements vs. a conventional approach (you'll see these used in 4.)But my expectation is that your copies will be pretty minor overhead compared with any linalg heavy lifting. It's good to be aware of what the numbers are though. I wouldn't expect OpenCL or whatever for CPU to magically offer any improvements here (unless your system's memcpy is poorly implemented); IMHO it's better to dig into this stuff in more detail, getting down to the basics of what's actually happening at the level of instructions, registers, cache lines and pages, than it is to move away from that by layering another level of abstraction on top.
Of course if you're considering porting your code from whatever multicore BLAS library you're using currently to a GPU accelerated linear algebra version, this becomes a completely different (and much more complicated) question (see JayC's comment below). If you want substantial performance gains you should certainly be considering it though.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With