Consider the numpy array a
a = np.array([1, 0, 2, 1, 1])
If I do a bin count, I get integers
np.bincount(a)
array([1, 3, 1])
But if I add weights to perform the equivalent bin count
np.bincount(a, np.ones_like(a))
array([ 1., 3., 1.])
Same values but float
. What is the smartest way to manipulate these to int
? Why doesn't numpy assume the same dtype as what was passed as weights?
Why doesn't numpy assume the same dtype as what was passed as weights?
There are two reasons:
There are several ways to weight a count, either by it multiplying the value with the weight or by multiplying the value with the weight divided by the sum of the weights. In the latter case it will be always a double (just because otherwise the division would be inaccurate).
In my experience weighting with the normalized weights (the second case) is more common. So it's actually reasonable (and definitely faster) to assume they are floats.
Overflow. It's not possible that the counts exceed the integer limit because the array can't have more values than this limit (stands the reason, otherwise you couldn't index the array). But if you multiply it with the weights it's not hard to make the counts "overflow".
I guess in this case it was probably the latter reason.
It's unlikely someone would use really large integer weights and lots of duplicate values - but just assume what would happen if:
import numpy as np
i = 10000000
np.bincount(np.ones(100000000, dtype=int), weights=np.ones(10000000, dtype=int)*1000000000000)
would return:
array([0, -8446744073709551616])
instead of the actual result:
array([ 0.00000000e+00, 1.00000000e+19])
That combined with the first reason and the fact that it's very easy (personally I think it's trivial) to convert float arrays to integer arrays:
np.asarray(np.bincount(...), dtype=int)
Probably made float
to the "actual" returned dtype of the weighted bincount
.
The numpy source actually mentions that the weights
need to be convertable to double
(float64
):
/* * arr_bincount is registered as bincount. * * bincount accepts one, two or three arguments. The first is an array of * non-negative integers The second, if present, is an array of weights, * which must be promotable to double. Call these arguments list and * weight. Both must be one-dimensional with len(weight) == len(list). If * weight is not present then bincount(list)[i] is the number of occurrences * of i in list. If weight is present then bincount(self,list, weight)[i] * is the sum of all weight[j] where list [j] == i. Self is not used. * The third argument, if present, is a minimum length desired for the * output array. */
And well, they then just cast it to double in the function. That's the "literal" reason why you get a result of floating data type.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With