Remainder function (%) runtime on numpy arrays is far longer than manual remainder calculation

Tags:

In the past few days I've been working on improving the runtime of a python function which requires many uses of the remainder function (%) among other things. My main test case is over an 80,000 element numpy array (monotonically increasing), with 10000 iterations, although I've tried on various other sizes as well.

Eventually I reached a point where the remainder function is a major bottleneck, and tried various solutions. This is the behaviour I found when running the following code:

import numpy as np import time  a = np.random.rand(80000) a = np.cumsum(a) d = 3 start_time1 = time.time() for i in range(10000):     b = a % d     d += 0.001 end_time1 = time.time() d = 3 start_time2 = time.time() for i in range(10000):     b = a - (d * np.floor(a / d))     d += 0.001 end_time2 = time.time() print((end_time1 - start_time1) / 10000) print((end_time2 - start_time2) / 10000)

The output is:

0.0031344462633132934 0.00022937238216400147

when increasing the array size to 800,000:

0.014903099656105041 0.010498356819152833

(For this post I ran the code only once for actual output, while trying to understand the issue I've gotten these results consistently.)

While this solves my runtime problem - I have a hard time understand why. Am I missing something? The only difference I can think of is the overhead of an additional function call, but the first case is pretty extreme (and 1.5x the runtime isn't good enough either), and if that were the case I would think that the existance of the np.remainder function is pointless.

Edit: I tried testing the same code with non-numpy loops:

import numpy as np import time   def pythonic_remainder(array, d):     b = np.zeros(len(array))     for i in range(len(array)):         b[i] = array[i] % d  def split_pythonic_remainder(array, d):     b = np.zeros(len(array))     for i in range(len(array)):         b[i] = array[i] - (d * np.floor(array[i] / d))  def split_remainder(a, d):     return a - (d * np.floor(a / d))  def divide(array, iterations, action):     d = 3     for i in range(iterations):         b = action(array, d)         d += 0.001  a = np.random.rand(80000) a = np.cumsum(a) start_time = time.time() divide(a, 10000, split_remainder) print((time.time() - start_time) / 10000)  start_time = time.time() divide(a, 10000, np.remainder) print((time.time() - start_time) / 10000) start_time = time.time() divide(a, 10000, pythonic_remainder) print((time.time() - start_time) / 10000)  start_time = time.time() divide(a, 10000, split_pythonic_remainder) print((time.time() - start_time) / 10000)

The result I get is:

0.0003770533800125122 0.003932329940795899 0.018835473942756652 0.10940513386726379

I find it interesting that the opposite is true in the non-numpy case.

748

asked Sep 09 '18 08:09

shaul

1 Answers

My best hypothesis is that your NumPy install is using an unoptimized fmod inside the % calculation. Here's why.

First, I can't reproduce your results on a normal pip installed version of NumPy 1.15.1. I get only about a 10% performance difference (asdf.py contains your timing code):

$ python3.6 asdf.py 0.0006543657302856445 0.0006025806903839111

I can reproduce a major performance discrepancy with a manual build (python3.6 setup.py build_ext --inplace -j 4) of v1.15.1 from a clone of the NumPy Git repository, though:

$ python3.6 asdf.py 0.00242799973487854 0.0006397026300430298

This suggests that my pip-installed build's % is better optimized than my manual build or what you have installed.

Looking under the hood, it's tempting to look at the implementation of floating-point % in NumPy and blame the slowdown on the unnecessary floordiv calculation (npy_divmod@c@ calculates both // and %):

NPY_NO_EXPORT void @TYPE@_remainder(char **args, npy_intp *dimensions, npy_intp *steps, void *NPY_UNUSED(func)) {     BINARY_LOOP {         const @type@ in1 = *(@type@ *)ip1;         const @type@ in2 = *(@type@ *)ip2;         npy_divmod@c@(in1, in2, (@type@ *)op1);     } }

but in my experiments, removing the floordiv provided no benefit. It looks easy enough for a compiler to optimize out, so maybe it was optimized out, or maybe it was just a negligible fraction of the runtime in the first place.

Rather than the floordiv, let's focus on just one line in npy_divmod@c@, the fmod call:

mod = npy_fmod@c@(a, b);

This is the initial remainder computation, before special-case handling and adjusting the result to match the sign of the right-hand operand. If we compare the performance of % with numpy.fmod on my manual build:

>>> import timeit >>> import numpy >>> a = numpy.arange(1, 8000, dtype=float) >>> timeit.timeit('a % 3', globals=globals(), number=1000) 0.3510419335216284 >>> timeit.timeit('numpy.fmod(a, 3)', globals=globals(), number=1000) 0.33593094255775213 >>> timeit.timeit('a - 3*numpy.floor(a/3)', globals=globals(), number=1000) 0.07980139832943678

We see that fmod appears to be responsible for almost the entire runtime of %.

I haven't disassembled the generated binary or stepped through it in an instruction-level debugger to see exactly what gets executed, and of course, I don't have access to your machine or your copy of NumPy. Still, from the above evidence, fmod seems like a pretty likely culprit.

176

answered Oct 02 '22 20:10

user2357112 supports Monica

Related questions
                            
                                replace() method not working on Pandas DataFrame
                            
                                Cannot import cv2 in python in OSX
                            
                                How to make this Block of python code short and efficient
                            
                                Running Ruby, Node, Python and Docker on the new Apple Silicon architecture? [closed]
                            
                                aiogevent event loop "fails" to track greenlets
                            
                                cx-freeze, runpy and multiprocessing - multiple paths to failure
                            
                                Occasionally, Django messages are repeated across requests (i.e., they are not cleared)
                            
                                DCGAN debugging. Getting just garbage
                            
                                Run Python Debugger (pdb) in Sublime Text 3
                            
                                Pytorch vs. Keras: Pytorch model overfits heavily
                            
                                Pretty print json but keep inner arrays on one line python
                            
                                Can packages be shared across Anaconda environments?
                            
                                Simple Python implementation of collaborative topic modeling?
                            
                                Unit Testing: Assert that a file/path exists
                            
                                Using python and matplotlib on android
                            
                                Pass --no-deps in PIP requirements.txt
                            
                                benchmarks: does python have a faster way of walking a network folder?
                            
                                What is the actual impact of calling socket.recv with a bufsize that is not a power of 2?
                            
                                Why is the same SQLite query being 30 times slower when fetching only twice as many results?
                            
                                Can I get the python call stack with the linux perf?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Remainder function (%) runtime on numpy arrays is far longer than manual remainder calculation

Tags:

performance

python

numpy

shaul

People also ask

1 Answers

user2357112 supports Monica

Recent Activity

Donate For Us