Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Sum of consecutive integers in numpy is incorrect

In summing the first 100,000,000 positive integers using the following:

import numpy as np
np.arange(1,100000001).sum()

I return: 987459712, which does not match the formula: N(N+1)/2 for N=100000000. Namely, the formula returns 5000000050000000.

Before posting, I wrote the following, which returns True:

np.arange(1,65536).sum() == ((65535+1) * 65535)/2

However, the number 65536 seems to be a critical point, as

np.arange(1,65537).sum() == ((65536+1) * 65536)/2

returns False.

For integers greater than 65536 the code returns False, whereas integers below this threshold return True.

Could someone explain either what I've done wrong in calculating the sum, or what is going on with the code?

like image 668
sunspots Avatar asked Jan 03 '23 05:01

sunspots


2 Answers

Seems like numpy sometimes has a hard time guessing the correct datatype.

On my system, Win 10 64-bit, Python 3.4.4, numpy 1.13.1:

>>  np.arange(1, 100000001).sum()
987459712
>>  np.arange(1, 100000001).dtype
dtype('int32')

But, if we "help" numpy it gets the correct result:

>> np.arange(1, 100000001, dtype=np.int64).sum()
500000005000000

The wrong result is obviously due to 32-bit integer overflowing.

like image 127
DeepSpace Avatar answered Jan 11 '23 06:01

DeepSpace


It isn't really that numpy has a hard time guessing things, it's just that the default int type is the same as C long type:

int_: Default integer type (same as C long; normally either int64 or int32)

For windows systems, longs are 32bit, even on 64bit builds (see here for more) so that's what's used by default, int32.

As DeepSpace suggested, setting dtype to int64 does the trick. This can be done either in arange or in the sum method.


Additionally, you wrote:

Before posting, I wrote the following, which returns True:

np.arange(1,65536).sum() == ((65535+1) * 65535)/2

However, the number 65536 seems to be a critical point, as

np.arange(1,65537).sum() == ((65536+1) * 65536)/2

returns False.

and this is explained by the fact that the second sum exceeds int32's max value while the first doesn't:

>> np.arange(1,65536).sum() < np.iinfo(np.int32).max
True    
>>> np.arange(1,65537).sum() < np.iinfo(np.int32).max
False

of course the Python calculation is correct due to Python 3's arbitrary precision ints.


This is why many of us weren't able to reproduce. On most Unixes the default int size for 64bit machines is int64 (since the C long is 64bits) therefore the sum of those ints was equal to the expected value.

like image 31
Dimitris Fasarakis Hilliard Avatar answered Jan 11 '23 07:01

Dimitris Fasarakis Hilliard