Here is a comparison I made. <code>np.argsort</code> was timed on a float32 ndarray consists of 1,000,000 elements. <pre class="prettyprint"><code>In [1]: import numpy as np In [2]: a = np.random.randn(1000000) In [3]: a = a.astype(np.float32) In [4]: %timeit np.argsort(a) 86.1 ms ± 1.59 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) </code></pre> And here is a C++ program do the same procedure but on vectors referring to this answer. <pre class="prettyprint"><code>#include <iostream> #include <vector> #include <cstddef> #include <algorithm> #include <opencv2/opencv.hpp> #include <numeric> #include <utility> int main() { std::vector<float> numbers; for (int i = 0; i != 1000000; ++i) { numbers.push_back((float)rand() / (RAND_MAX)); } double e1 = (double)cv::getTickCount(); std::vector<size_t> idx(numbers.size()); std::iota(idx.begin(), idx.end(), 0); std::sort(idx.begin(), idx.end(), [&numbers](const size_t &a, const size_t &b) { return numbers[a] < numbers[b];}); double e2 = (double)cv::getTickCount(); std::cout << "Finished in " << 1000 * (e2 - e1) / cv::getTickFrequency() << " milliseconds." << std::endl; return 0; } </code></pre> It prints <code>Finished in 525.908 milliseconds.</code> and it is far slower than the numpy version. So could anyone explain what makes <code>np.argsort</code> so fast? Thanks. <hr> Edit1: <code>np.__version__</code> returns <code>1.15.0</code> which runs on <code>Python 3.6.6 |Anaconda custom (64-bit)</code> and <code>g++ --version</code> prints 8.2.0. Operating system is Manjaro Linux. <hr> <del>Edit2: I treid to compile with <code>-O2</code> and <code>-O3</code> flags in <code>g++</code> and I got result within 216.515 miliseconds and 205.017 miliseconds. That is an improve but still slower than numpy version. (Referring to this question)</del> This was deleted because I mistakenly run the test with my laptop's DC adapter unplugged, which would cause it slow down. In a fair competition, C-array and vector version perform equally (take about 100ms). <hr> Edit3: Another approach would be to replace vector with C like array: <code>float numbers[1000000];</code>. After that the running time is about 100ms(+/-5ms). Full code here: <pre class="prettyprint"><code>#include <iostream> #include <vector> #include <cstddef> #include <algorithm> #include <opencv2/opencv.hpp> #include <numeric> #include <utility> int main() { //std::vector<float> numbers; float numbers[1000000]; for (int i = 0; i != 1000000; ++i) { numbers[i] = ((float)rand() / (RAND_MAX)); } double e1 = (double)cv::getTickCount(); std::vector<size_t> idx(1000000); std::iota(idx.begin(), idx.end(), 0); std::sort(idx.begin(), idx.end(), [&numbers](const size_t &a, const size_t &b) { return numbers[a] < numbers[b];}); double e2 = (double)cv::getTickCount(); std::cout << "Finished in " << 1000 * (e2 - e1) / cv::getTickFrequency() << " milliseconds." << std::endl; return 0; } </code></pre>

Ideas: <ul> <li>different underlying algorithm:. <code>np.argsort</code> uses quicksort as default, the implementation in <code>C++</code> might depend on your compiler.</li> <li><strike>function call overhead: I'm not sure if <code>C++</code> compilers inline your comparison function. If not, calling this function also might introduce some overhead.</strike> not the case according to this post</li> <li>compiler flags ?</li> </ul>

C++ - vector version implement of argsort low effiency compared to the one in numpy

Tags:

c++

performance

python

numpy

Here is a comparison I made. np.argsort was timed on a float32 ndarray consists of 1,000,000 elements.

In [1]: import numpy as np

In [2]: a = np.random.randn(1000000)

In [3]: a = a.astype(np.float32)

In [4]: %timeit np.argsort(a)
86.1 ms ± 1.59 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

And here is a C++ program do the same procedure but on vectors referring to this answer.

#include <iostream>
#include <vector>
#include <cstddef>
#include <algorithm>
#include <opencv2/opencv.hpp>
#include <numeric>
#include <utility>
int main()
{
  std::vector<float> numbers;
  for (int i = 0; i != 1000000; ++i) {
    numbers.push_back((float)rand() / (RAND_MAX));
  }

  double e1 = (double)cv::getTickCount();

  std::vector<size_t> idx(numbers.size());
  std::iota(idx.begin(), idx.end(), 0);

  std::sort(idx.begin(), idx.end(), [&numbers](const size_t &a, const size_t &b)
                                               { return numbers[a] < numbers[b];});

  double e2 = (double)cv::getTickCount();
  std::cout << "Finished in " << 1000 * (e2 - e1) / cv::getTickFrequency() << " milliseconds." << std::endl;
  return 0;
}

It prints Finished in 525.908 milliseconds. and it is far slower than the numpy version. So could anyone explain what makes np.argsort so fast? Thanks.

Edit1: np.__version__ returns 1.15.0 which runs on Python 3.6.6 |Anaconda custom (64-bit) and g++ --version prints 8.2.0. Operating system is Manjaro Linux.

Edit2: I treid to compile with -O2 and -O3 flags in g++ and I got result within 216.515 miliseconds and 205.017 miliseconds. That is an improve but still slower than numpy version. (Referring to this question) This was deleted because I mistakenly run the test with my laptop's DC adapter unplugged, which would cause it slow down. In a fair competition, C-array and vector version perform equally (take about 100ms).

Edit3: Another approach would be to replace vector with C like array: float numbers[1000000];. After that the running time is about 100ms(+/-5ms). Full code here:

#include <iostream>
#include <vector>
#include <cstddef>
#include <algorithm>
#include <opencv2/opencv.hpp>
#include <numeric>
#include <utility>
int main()
{
  //std::vector<float> numbers;
  float numbers[1000000];
  for (int i = 0; i != 1000000; ++i) {
    numbers[i] = ((float)rand() / (RAND_MAX));
  }

  double e1 = (double)cv::getTickCount();

  std::vector<size_t> idx(1000000);
  std::iota(idx.begin(), idx.end(), 0);

  std::sort(idx.begin(), idx.end(), [&numbers](const size_t &a, const size_t &b)
                                               { return numbers[a] < numbers[b];});

  double e2 = (double)cv::getTickCount();
  std::cout << "Finished in " << 1000 * (e2 - e1) / cv::getTickFrequency() << " milliseconds." << std::endl;
  return 0;
}

419

asked Aug 23 '18 09:08

Page David

2 Answers

I took your implementation and measured it with 10000000 items. It took approximated 1.7 seconds.

Now I introduced a class

class valuePair {
  public:
    valuePair(int idx, float value) : idx(idx), value(value){};
    int idx;
    float value;
};

with is initialized as

std::vector<valuePair> pairs;
for (int i = 0; i != 10000000; ++i) {
    pairs.push_back(valuePair(i, (double)rand() / (RAND_MAX)));
}

and the sorting than is done with

std::sort(pairs.begin(), pairs.end(), [&](const valuePair &a, const valuePair &b) { return a.value < b.value; });

This code reduces the runtime down to 1.1 seconds. This is I think due to a better cache coherency, but still quite far away from the python results.

173

answered Sep 24 '22 02:09

schorsch312

Ideas:

different underlying algorithm:. np.argsort uses quicksort as default, the implementation in C++ might depend on your compiler.
~~function call overhead: I'm not sure if C++ compilers inline your comparison function. If not, calling this function also might introduce some overhead.~~ not the case according to this post
compiler flags ?

answered Sep 23 '22 02:09

rocksportrocker

Related questions
                            
                                MongoEngine - Another user is already authenticated to this database. You must logout first
                            
                                pandas: Composition for chained methods like .resample(), .rolling() etc
                            
                                When running __main__.py, get current module
                            
                                Any idea to optimise this algorithm?
                            
                                Django logging during migration
                            
                                AWS Lambda Policy Length Exceeded - adding rules to a lambda function
                            
                                Python PDF read straight across as how it looks in the PDF
                            
                                How to instantiate a Google API service using google-auth?
                            
                                What does X_set[y_set == j, 0] mean?
                            
                                Setting Icon for PyInstaller Application
                            
                                Predicted values of each fold in K-Fold Cross Validation in sklearn
                            
                                Why is that slicing expression generating that output [duplicate]
                            
                                Keep x/y axes the same lengths in seaborn/matplotlib
                            
                                non-uniform spacing with numpy.gradient
                            
                                Sparse DataArray Xarray search
                            
                                Why is dataclasses.astuple returning a deepcopy of class attributes?
                            
                                Splitting up pybind11 modules and issues with automatic type conversion
                            
                                Google Sheets API for python2.7 --> "Invalid JSON payload. Root element must be a message"
                            
                                connecting mysql with pyspark
                            
                                sqlalchemy joinedload: syntax to load multiple relationships more than 1 degree separated from query table?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With