Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

NumPy performance: uint8 vs. float and multiplication vs. division?

I have just noticed that the execution time of a script of mine nearly halves by only changing a multiplication to a division.

To investigate this, I have written a small example:

import numpy as np                                                                                                                                                                                
import timeit

# uint8 array
arr1 = np.random.randint(0, high=256, size=(100, 100), dtype=np.uint8)

# float32 array
arr2 = np.random.rand(100, 100).astype(np.float32)
arr2 *= 255.0


def arrmult(a):
    """ 
    mult, read-write iterator
    """
    b = a.copy()
    for item in np.nditer(b, op_flags=["readwrite"]):
        item[...] = (item + 5) * 0.5

def arrmult2(a):
    """ 
    mult, index iterator
    """
    b = a.copy()
    for i, j in np.ndindex(b.shape):
        b[i, j] = (b[i, j] + 5) * 0.5

def arrmult3(a):
    """
    mult, vectorized
    """
    b = a.copy()
    b = (b + 5) * 0.5

def arrdiv(a):
    """ 
    div, read-write iterator 
    """
    b = a.copy()
    for item in np.nditer(b, op_flags=["readwrite"]):
        item[...] = (item + 5) / 2

def arrdiv2(a):
    """ 
    div, index iterator
    """
    b = a.copy()
    for i, j in np.ndindex(b.shape):
           b[i, j] = (b[i, j] + 5)  / 2                                                                                 

def arrdiv3(a):                                                                                                     
    """                                                                                                             
    div, vectorized                                                                                                 
    """                                                                                                             
    b = a.copy()                                                                                                    
    b = (b + 5) / 2                                                                                               




def print_time(name, t):                                                                                            
    print("{: <10}: {: >6.4f}s".format(name, t))                                                                    

timeit_iterations = 100                                                                                             

print("uint8 arrays")                                                                                               
print_time("arrmult", timeit.timeit("arrmult(arr1)", "from __main__ import arrmult, arr1", number=timeit_iterations))
print_time("arrmult2", timeit.timeit("arrmult2(arr1)", "from __main__ import arrmult2, arr1", number=timeit_iterations))
print_time("arrmult3", timeit.timeit("arrmult3(arr1)", "from __main__ import arrmult3, arr1", number=timeit_iterations))
print_time("arrdiv", timeit.timeit("arrdiv(arr1)", "from __main__ import arrdiv, arr1", number=timeit_iterations))  
print_time("arrdiv2", timeit.timeit("arrdiv2(arr1)", "from __main__ import arrdiv2, arr1", number=timeit_iterations))
print_time("arrdiv3", timeit.timeit("arrdiv3(arr1)", "from __main__ import arrdiv3, arr1", number=timeit_iterations))

print("\nfloat32 arrays")                                                                                           
print_time("arrmult", timeit.timeit("arrmult(arr2)", "from __main__ import arrmult, arr2", number=timeit_iterations))
print_time("arrmult2", timeit.timeit("arrmult2(arr2)", "from __main__ import arrmult2, arr2", number=timeit_iterations))
print_time("arrmult3", timeit.timeit("arrmult3(arr2)", "from __main__ import arrmult3, arr2", number=timeit_iterations))
print_time("arrdiv", timeit.timeit("arrdiv(arr2)", "from __main__ import arrdiv, arr2", number=timeit_iterations))  
print_time("arrdiv2", timeit.timeit("arrdiv2(arr2)", "from __main__ import arrdiv2, arr2", number=timeit_iterations))
print_time("arrdiv3", timeit.timeit("arrdiv3(arr2)", "from __main__ import arrdiv3, arr2", number=timeit_iterations))

This prints the following timings:

uint8 arrays
arrmult   : 2.2004s
arrmult2  : 3.0589s
arrmult3  : 0.0014s
arrdiv    : 1.1540s
arrdiv2   : 2.0780s
arrdiv3   : 0.0027s

float32 arrays
arrmult   : 1.2708s
arrmult2  : 2.4120s
arrmult3  : 0.0009s
arrdiv    : 1.5771s
arrdiv2   : 2.3843s
arrdiv3   : 0.0009s

I always thought a multiplication is computationally cheaper than a division. However, for uint8 a division seems to be nearly twice as effective. Does this somehow relate to the fact, that * 0.5 has to calculate the multiplication in a float and then casting the result back to to an integer?

At least for floats multiplications seem to be faster than divisions. Is this generally true?

Why is a multiplication in uint8 more expansive than in float32? I thought an 8-bit unsigned integer should be much faster to calculate than 32-bit floats?!

Can someone "demystify" this?

EDIT: to have more data, I've included vectorized functions (like suggested) and added index iterators as well. The vectorized functions are much faster, thus not really comparable. However, if timeit_iterations is set much higher for the vectorized functions, it turns out that multiplication is faster for both, uint8 and float32. I guess this confuses even more?!

Maybe multiplication is in fact always faster than division, but the main performance leaks in the for-loops is not the arithmetical operation, but the loop itself. Although this does not explain why the loops behave differently for different operations.

EDIT2: Like @jotasi already stated, we are looking for a full explanation of division vs. multiplication and int(or uint8) vs. float (or float32). Additionally, explaining the different trends of the vectorized approaches and the iterators would be interesting, as in the vectorized case, the division seems to be slower, whereas it is faster in the iterator case.

like image 384
daniel451 Avatar asked Aug 23 '16 14:08

daniel451


2 Answers

The problem is your assumption, that you measure the time needed for division or multiplication, which is not true. You are measuring the overhead needed for a division or multiplication.

One has really to look at the exact code to explain every effect, which can vary from version to version. This answer can only give an idea, what one has to consider.

The problem is that a simple int is not simple at all in python: it is a real object which must be registered in the garbage collector, it grows in size with its value - for all that you have to pay: for example for a 8bit integer 24 bytes memory are needed! similar goes for python-floats.

On the other hand, a numpy array consists of simple c-style integers/floats without overhead, you save a lot of memory, but pay for it during the access to an element of numpy-array. a[i] means: a python-integer must be constructed, registered in the garbage collector and only than it can be used - there is a lot of overhead.

Consider this code:

li1=[x%256 for x in xrange(10**4)]
arr1=np.array(li1, np.uint8)

def arrmult(a):    
    for i in xrange(len(a)):
        a[i]*=5;

arrmult(li1) is 25 faster than arrmult(arr1) because integers in the list are already python-ints and don't have to be created! The lion's share of the calculation time is needed for creation of the objects - everything else can be almost neglected.


Let's take a look at your code, first the multiplication:

def arrmult2(a):
    ...
    b[i, j] = (b[i, j] + 5) * 0.5

In the case of the uint8 the following must happen (I neglect +5 for simplicity):

  1. a python-int must be created
  2. it must be casted to a float (python-float creation), in order to be able to do float multiplication
  3. and casted back to a python-int or/and uint8

For float32, there is less work to do (multiplication does not cost much): 1. a python-float created 2. casted back float32.

So the float-version should be faster and it is.


Now let's take a look at the division:

def arrdiv2(a):
    ...
    b[i, j] = (b[i, j] + 5)  / 2 

The pitfall here: All operations are integer-operations. So compared to multiplication there is no need to cast to python-float, thus we have less overhead as in the case of multiplication. Division is "faster" for unint8 than multiplication in your case.

However, division and multiplication are equally fast/slow for float32, because almost nothing has changed in this case - we still need to create a python-float.


Now the vectorized versions: they work with c-style "raw" float32s/uint8s without conversion (and its cost!) to the corresponding python-objects under the hood. To get meaningful results you should increase the number of iteration (right now the running time is too small to say something with certainty).

  1. division and multiplication for float32 could have the same running time, because I would expect numpy to replace the division by 2 through multiplication by 0.5 (but to be sure one has to look into the code).

  2. multiplication for uint8 should be slower, because every uint8-integer must be casted to a float prior to multiplication with 0.5 and than casted back to uint8 afterwards.

  3. for the uint8 case, the numpy cannot replace the division by 2 through multiplication with 0.5 because it is an integer division. Integer division is slower than float-multiplication for a lot of architectures - this is the slowest vectorized operation.


PS: I would not dwell too much about costs multiplication vs. division - there are too many other things that can have a bigger hit on the performance. For example creating unnecessary temporary objects or if the numpy-array is large and does not fit into the cache, than the memory access will be the bottle-neck - you will see no difference between multiplication and division at all.

like image 169
ead Avatar answered Oct 09 '22 23:10

ead


This answer only looks at vectorised operations, as the reason for the other operations being slow has been answered by ead.

A lot of "optimisations" are based on old hardware. The assumptions that meant that optimisations held true on older hardware do not old true on newer hardware.

Pipelines and division

Division is slow. Division operations consist of several units that each have to perform one calculation one after another. This is what makes division slow.

However, in a floating-point processing unit (FPU) [common on most modern CPUs] there are dedicated units arranged in a "pipeline" for the division instruction. Once a unit is done, that unit isn't needed for the rest of the operation. If you have several division operations you can get these units with nothing to do started on the next division operation. So though each operation is slow, the FPU can actually achieve a high throughput of division operations. Pipeline-ing isn't the same as vectorisation, but the results are mostly the same -- higher throughput when you have lots of the same operations to do.

Think of pipeline-ing like traffic. Compare three lanes of traffic moving at 30 mph versus one lane of traffic moving at 90 mph. The slower traffic is definitely slower individually, but the three-lane-road still has the same throughput.

like image 5
Dunes Avatar answered Oct 09 '22 22:10

Dunes