Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Error with numpy array calculations using int dtype (it fails to cast dtype to 64 bit automatically when needed)

I'm encountering a problem with incorrect numpy calculations when the inputs to a calculation are a numpy array with a 32-bit integer data type, but the outputs include larger numbers that require 64-bit representation.

Here's a minimal working example:

arr = np.ones(5, dtype=int) * (2**24 + 300)  # arr.dtype defaults to 'int32'

# Following comment from @hpaulj I changed the first line, which was originally:
# arr = np.zeros(5, dtype=int) 
# arr[:] = 2**24 + 300

single_value_calc = 2**8 * (2**24 + 300)
numpy_calc = 2**8 * arr

print(single_value_calc)
print(numpy_calc[0])

# RESULTS
4295044096
76800

The desired output is that the numpy array contains the correct value of 4295044096, which requires 64 bits to represent it. i.e. I would have expected numpy arrays to automatically upcast from int32 to int64 when the output requires it, rather maintaining a 32-bit output and wrapping back to 0 after the value of 2^32 is exceeded.

Of course, I can fix the problem manually by forcing int64 representation:

numpy_calc2 = 2**8 * arr.astype('int64')

but this is undesirable for general code, since the output will only need 64-bit representation (i.e. to hold large numbers) in some cases and not all. In my use case, performance is critical so forcing upcasting every time would be costly.

Is this the intended behaviour of numpy arrays? And if so, is there a clean, performant solution please?

like image 631
SLhark Avatar asked Oct 15 '22 09:10

SLhark


1 Answers

Type casting and promotion in numpy is fairly complicated and occasionally surprising. This recent unofficial write-up by Sebastian Berg explains some of the nuances of the subject (mostly concentrating on scalars and 0d arrays).

Quoting from this document:

Python Integers and Floats

Note that python integers are handled exactly like numpy ones. They are, however, special in that they do not have a dtype associated with them explicitly. Value based logic, as described here, seems useful for python integers and floats to allow:

arr = np.arange(10, dtype=np.int8)
arr += 1
# or:
res = arr + 1
res.dtype == np.int8

which ensures that no upcast (for example with higher memory usage) occurs.

(emphasis mine.)

See also Allan Haldane's gist suggesting C-style type coercion, linked from the previous document:

Currently, when two dtypes are involved in a binary operation numpy's principle is that "the output dtype's range covers the range of both input dtypes", and when a single dtype is involved there is never any cast.

(emphasis again mine.)

So my understanding is that the promotion rules for numpy scalars and arrays differ, primarily because it's not feasible to check every element inside an array to determine whether casting can be done safely. Again from the former document:

Scalar based rules

Unlike arrays, where inspection of all values is not feasable, for scalars (and 0-D arrays) the value is inspected.

This would mean that you can either use np.int64 from the start to be safe (and if you're on linux then dtype=int will actually do this on its own), or check the maximum value of your arrays before suspect operations and determine if you have to promote the dtype yourself, on a case-by-case basis. I understand that this might not be feasible if you are doing a lot of calculations, but I don't believe there is a way around this considering numpy's current type promotion rules.

like image 179