Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What's the fastest way to copy and manipulate large, dense 2D arrays in c++

I'm trying to optimize my code, taking advantage of multicore processors, to both copy any manipulate large dense arrays.

For copying: I have a large dense array (approximately 6000x100000) from which I need to pull 15x100000 subarrays to do several computations down the pipe. The pipe consists of a lot of linear algebra functions that are being handled by blas, which is multicore. Whether or not the time to pull data will really matter compared to the linear algebra is an open question, but I'd like to err on the side of caution and make sure the data copying is optimized.

For manipulating: I have many different functions that manipulate arrays by with element or row. It would be best if each of these was done multicore.

My question is: is it best to use to right framework (OpenML, OpenCL) and let all the magic happen with the compiler, or are there good functions/libraries that do this faster?

like image 327
Deverp Avatar asked Dec 23 '12 18:12

Deverp


1 Answers

Your starting point should be good old memcpy. Some tips from someone who has for a long time been obsessed by "copying performance".

  1. Read What Every Programmer Should Know About Memory.
  2. Benchmark your systems memcpy performance e.g memcpy_bench function here.
  3. Benchmark the scalability of memcpy when it's run on multiple cores e.g multi_memcpy_bench here. (Unless you're on some multi-socket NUMA HW, I think you won't see much benefit to multithreaded copying).
  4. Dig into your system's implementation of memcpy and understand them. The days you'd find most of the time spent in a solitary rep movsd are long gone; last time I looked at gcc and Intel compiler's CRTs they both varied their strategy depending on the size of the copy relative to the CPU's cache size.
  5. On Intel, understand the advantages of the non cache-polluting store instructions (e.g movntps) as these can achieve significant throughput improvements vs. a conventional approach (you'll see these used in 4.)
  6. Have access to and know how to use a sampling profiler to identify how much of your apps' time is spent in copying operations. There are also more advanced tools which can look at CPU performance counters and tell you all sorts of things about what the various caches are doing etc.
  7. (Advanced topic) Be aware of the TLB and when huge pages can help.

But my expectation is that your copies will be pretty minor overhead compared with any linalg heavy lifting. It's good to be aware of what the numbers are though. I wouldn't expect OpenCL or whatever for CPU to magically offer any improvements here (unless your system's memcpy is poorly implemented); IMHO it's better to dig into this stuff in more detail, getting down to the basics of what's actually happening at the level of instructions, registers, cache lines and pages, than it is to move away from that by layering another level of abstraction on top.

Of course if you're considering porting your code from whatever multicore BLAS library you're using currently to a GPU accelerated linear algebra version, this becomes a completely different (and much more complicated) question (see JayC's comment below). If you want substantial performance gains you should certainly be considering it though.

like image 125
timday Avatar answered Sep 18 '22 14:09

timday