I've been profiling a bottleneck in my code (a function shown below) that gets called several million times. I could use tips on increasing the performance. The XXXs
numbers were taken from Sleepy.
Compiled with visual studio 2013, /O2
and other typical release settings.
indicies
is typically 0 to 20 values, and other parameters are the same size (b.size() == indicies.size() == temps.size() == temps[k].size()
).
1: double Object::gradient(const size_t j,
2: const std::vector<double>& b,
3: const std::vector<size_t>& indices,
4: const std::vector<std::vector<double>>& temps) const
5: 23.27s {
6: double sum = 0;
7: 192.16s for (size_t k : indices)
8: 32.05s if (k != j)
9: 219.53s sum += temps[k][j]*b[k];
10:
11: 320.21s return boost::math::isfinite(sum) ? sum : 0;
13: 22.86s }
Any ideas?
Thanks for the tips guys. Here were the results I got from the suggestions:
I found it interesting that switching to cbegin()
and cend()
had such a large impact. I guess the compiler isn't being as smart as it could there. I'm happy with the bump, but still curious if there's more room here through unrolling or vectorization.
For those interested, here my benchmark for isfinite(x)
:
boost::isfinite(x):
------------------------
SPEED: 761.164 per ms
TIME: 0.001314 ms
+/- 0.000023 ms
std::isfinite(x):
------------------------
SPEED: 266.835 per ms
TIME: 0.003748 ms
+/- 0.000065 ms
We can optimize loops by vectorizing operations. This is one/two orders of magnitude faster than their pure Python equivalents (especially in numerical computations). Vectorization is something we can get with NumPy. Numpy is a library with efficient data structures designed to hold matrix data.
The for loop is the fastest but poorly readable. The foreach is fast, and iteration is controllable. The for…of takes time, but it's sweeter. The for…in takes time, hence less convenient.
Both for and for..of are 3.5 times faster than reduce .
If you know that the conditional will be met (that in every iteration you will meet k == j
), eliminate the conditional and replace the return condition with a simple conditional store.
double sum = -(temps[j][j]*b[j]);
for (size_t k : indices)
sum += temps[k][j]*b[k];
if (!std::isfinite(sum))
sum = 0.0;
return sum;
Range-based for is still new enough to not always get great optimization. You may also want to try:
const auto it = cend(indices);
for (auto it = cbegin(indices); it != end; ++it) {
sum += temps[*it][j]*b[*it];
}
and see if the perf varies.
There are two points that stick out:
(a) boost::math::isfinite is taking relatively long. If possible, try to establish with other ways that your results are within the valid range.
(b) storing a 2D array as a nested vector is not the fasted way. It would probably be faster to make temp a 1D array and pass the row size as a parameter. For the access to a an element you will have to do the index calculation yourself. But as one of the indices (j) is constant over the loop, you can compute the start element's index before the loop and inside just increment the index 1. That might get you some significant improvement.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With