Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Comparing NumPy arange and custom range function for producing ranges with decimal increments

Here's a custom function that allows stepping through decimal increments:

def my_range(start, stop, step):
    i = start
    while i < stop:
        yield i
        i += step

It works like this:

out = list(my_range(0, 1, 0.1))
print(out)

[0, 0.1, 0.2, 0.30000000000000004, 0.4, 0.5, 0.6, 0.7, 0.7999999999999999, 0.8999999999999999, 0.9999999999999999]

Now, there's nothing surprising about this. It's understandable this happens because of floating point inaccuracies and that 0.1 has no exact representation in memory. So, those precision errors are understandable.

Take numpy on the other hand:

import numpy as np

out = np.arange(0, 1, 0.1)
print(out)
array([ 0. ,  0.1,  0.2,  0.3,  0.4,  0.5,  0.6,  0.7,  0.8,  0.9]) 

What's interesting is that there are no visible imprecision accuracies introduced here. I thought this might have to do with what the __repr__ shows, so to confirm, I tried this:

x = list(my_range(0, 1.1, 0.1))[-1]
print(x.is_integer())

False

x = list(np.arange(0, 1.1, 0.1))[-1]
print(x.is_integer())

True

So, my function returns an incorrect upper value (it should be 1.0 but it is actually 1.0999999999999999), but np.arange does it correctly.

I'm aware of Is floating point math broken? but the point of this question is:

How does numpy do this?

like image 867
cs95 Avatar asked Aug 27 '17 16:08

cs95


3 Answers

The difference in endpoints is because NumPy calculates the length up front instead of ad hoc, because it needs to preallocate the array. You can see this in the _calc_length helper. Instead of stopping when it hits the end argument, it stops when it hits the predetermined length.

Calculating the length up front doesn't save you from the problems of a non-integer step, and you'll frequently get the "wrong" endpoint anyway, for example, with numpy.arange(0.0, 2.1, 0.3):

In [46]: numpy.arange(0.0, 2.1, 0.3)
Out[46]: array([ 0. ,  0.3,  0.6,  0.9,  1.2,  1.5,  1.8,  2.1])

It's much safer to use numpy.linspace, where instead of the step size, you say how many elements you want and whether you want to include the right endpoint.


It might look like NumPy has suffered no rounding error when calculating the elements, but that's just due to different display logic. NumPy is truncating the displayed precision more aggressively than float.__repr__ does. If you use tolist to get an ordinary list of ordinary Python scalars (and thus the ordinary float display logic), you can see that NumPy has also suffered rounding error:

In [47]: numpy.arange(0, 1, 0.1).tolist()
Out[47]: 
[0.0,
 0.1,
 0.2,
 0.30000000000000004,
 0.4,
 0.5,
 0.6000000000000001,
 0.7000000000000001,
 0.8,
 0.9]

It's suffered slightly different rounding error - for example, in .6 and .7 instead of .8 and .9 - because it also uses a different means of computing the elements, implemented in the fill function for the relevant dtype.

The fill function implementation has the advantage that it uses start + i*step instead of repeatedly adding the step, which avoids accumulating error on each addition. However, it has the disadvantage that (for no compelling reason I can see) it recomputes the step from the first two elements instead of taking the step as an argument, so it can lose a great deal of precision in the step up front.

like image 179
user2357112 supports Monica Avatar answered Nov 15 '22 22:11

user2357112 supports Monica


While arange does step through the range in a slightly different way, it still has the float representation issue:

In [1358]: np.arange(0,1,0.1)
Out[1358]: array([ 0. ,  0.1,  0.2,  0.3,  0.4,  0.5,  0.6,  0.7,  0.8,  0.9])

The print hides that; convert it to a list to see the gory details:

In [1359]: np.arange(0,1,0.1).tolist()
Out[1359]: 
[0.0,
 0.1,
 0.2,
 0.30000000000000004,
 0.4,
 0.5,
 0.6000000000000001,
 0.7000000000000001,
 0.8,
 0.9]

or with another iteration

In [1360]: [i for i in np.arange(0,1,0.1)]  # e.g. list(np.arange(...))
Out[1360]: 
[0.0,
 0.10000000000000001,
 0.20000000000000001,
 0.30000000000000004,
 0.40000000000000002,
 0.5,
 0.60000000000000009,
 0.70000000000000007,
 0.80000000000000004,
 0.90000000000000002]

In this case each displayed item is a np.float64, where as in the first each is float.

like image 21
hpaulj Avatar answered Nov 15 '22 21:11

hpaulj


Aside from the different representation of lists and arrays NumPys arange works by multiplying instead of repeated adding. It's more like:

def my_range2(start, stop, step):
    i = 0
    while start+(i*step) < stop:
        yield start+(i*step)
        i += 1

Then the output is completely equal:

>>> np.arange(0, 1, 0.1).tolist() == list(my_range2(0, 1, 0.1))
True

With repeated addition you would "accumulate" floating point rounding errors. The multiplication is still affected by rounding but the error doesn't accumulate.


As pointed out in the comments it's not really what is happening. As far as I see it it's more like:

def my_range2(start, stop, step):
    length = math.ceil((stop-start)/step)
    # The next two lines are mostly so the function really behaves like NumPy does
    # Remove them to get better accuracy...
    next = start + step
    step = next - start
    for i in range(length):
        yield start+(i*step)

But not sure if that's exactly right either because there's a lot more going on in NumPy.

like image 25
MSeifert Avatar answered Nov 15 '22 21:11

MSeifert