Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

C++ Classes for High Performance Computing

Tags:

c++

hpc

According to this Quora forum,

One of the simplest rules of thumb is to remember that hardware loves arrays, and is highly optimized for iteration over arrays. A simple optimization for many problems is just to stop using fancy data structures and just use plain arrays (or std::vectors in C++). This can take some getting used to.

Are C++ classes one of those "fancy data structures," i.e. a kind of data type that can be replaced by arrays to achieve a higher performance in a C++ program?

like image 849
NullPointerException Avatar asked Dec 14 '22 08:12

NullPointerException


1 Answers

If your class looks like this:

struct Person {
  double age;
  double income;
  size_t location;
};

then you might benefit from rearranging to

std::vector<double> ages;
std::vector<double> incomes;
std::vector<size_t> locations;

But it depends on your access patterns. If you frequently access multiple elements of a person at a time, then having the elements blocked together makes sense.

If your class looks like this:

struct Population {
  std::vector<double> many_ages;
  std::vector<double> many_incomes;
  std::vector<size_t> many_locations;
};

Then you're using the form your resource recommended. Using any one of these arrays individually is faster than using the first class, but using elements from all three arrays simultaneously is probably slower with the second class.

Ultimately, you should structure your code to be as clean and intuitive as a possible. The biggest source of speed will be a strong understanding and appropriate use of algorithms, not the memory layout. I recommend disregarding this unless you already have strong HPC skills and need to squeeze maximum performance from your machine. In almost every other case your development time and sanity is worth far more than saving a few clock cycles.

More broadly

  1. An interesting paper related to this is SLIDE: In Defense of Smart Algorithms over Hardware Acceleration for Large-Scale Deep Learning Systems. A lot of work has gone into mapping ML algorithms to GPUs and, for ML applications, getting the memory layout right does make a real difference since so much time is spent on training and GPUs are optimized specifically for contiguous-array processing. But, the authors of the paper contend that even here if you understand algorithms well you can beat specialized hardware with optimized memory layouts, and they demonstrate this by getting their CPU to train 3.5x faster than their GPU.

  2. More broadly, your question deals with the idea of cache misses. Since a cache miss is 200x more expensive than an L1 reference (link), if your data layout is optimized to your computation, then you can really save time. However, as the above suggests, it is rarely the case that simply rearranging your data magically makes everything faster. Consider matrix multiplication. It's the perfect example because the data is laid out in a single array, as requested by your resource. However, for a simple triple-loop matmult GEMM implementation there are still 6 ways to arrange your loops. Some of these ways are much more efficient than others, but none of them give you anywhere near peak performance. Read through this step-by-step explanation of matmult to get a better sense of all the algorithmic optimizations necessary to get good performance.

What the above should demonstrate is that even for situations in which we have only a few arrays laid out exactly as your resource suggests, the layout alone doesn't give us the speed. Good algorithms do. Data layout considerations, if any, flow from the algorithms we choose and higher-level hardware constraints.

If this is so for simple arrays and operations like matrix multiplication, by extension you should also expect it to be so for "fancy data structures" as well.

like image 189
Richard Avatar answered Jan 03 '23 05:01

Richard