Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to get an integer array from numpy.bincount when the weights parameter are integers

Tags:

python

numpy

Consider the numpy array a

a = np.array([1, 0, 2, 1, 1])

If I do a bin count, I get integers

np.bincount(a)

array([1, 3, 1])

But if I add weights to perform the equivalent bin count

np.bincount(a, np.ones_like(a))

array([ 1.,  3.,  1.])

Same values but float. What is the smartest way to manipulate these to int? Why doesn't numpy assume the same dtype as what was passed as weights?

like image 765
piRSquared Avatar asked Oct 17 '22 11:10

piRSquared


1 Answers

Why doesn't numpy assume the same dtype as what was passed as weights?

There are two reasons:

  • There are several ways to weight a count, either by it multiplying the value with the weight or by multiplying the value with the weight divided by the sum of the weights. In the latter case it will be always a double (just because otherwise the division would be inaccurate).

    In my experience weighting with the normalized weights (the second case) is more common. So it's actually reasonable (and definitely faster) to assume they are floats.

  • Overflow. It's not possible that the counts exceed the integer limit because the array can't have more values than this limit (stands the reason, otherwise you couldn't index the array). But if you multiply it with the weights it's not hard to make the counts "overflow".

I guess in this case it was probably the latter reason.

It's unlikely someone would use really large integer weights and lots of duplicate values - but just assume what would happen if:

import numpy as np

i = 10000000
np.bincount(np.ones(100000000, dtype=int), weights=np.ones(10000000, dtype=int)*1000000000000)

would return:

array([0, -8446744073709551616])

instead of the actual result:

array([  0.00000000e+00,   1.00000000e+19])

That combined with the first reason and the fact that it's very easy (personally I think it's trivial) to convert float arrays to integer arrays:

np.asarray(np.bincount(...), dtype=int)

Probably made float to the "actual" returned dtype of the weighted bincount.

The "literal" reason:

The numpy source actually mentions that the weights need to be convertable to double (float64):

/*
 * arr_bincount is registered as bincount.
 *
 * bincount accepts one, two or three arguments. The first is an array of
 * non-negative integers The second, if present, is an array of weights,
 * which must be promotable to double. Call these arguments list and
 * weight. Both must be one-dimensional with len(weight) == len(list). If
 * weight is not present then bincount(list)[i] is the number of occurrences
 * of i in list.  If weight is present then bincount(self,list, weight)[i]
 * is the sum of all weight[j] where list [j] == i.  Self is not used.
 * The third argument, if present, is a minimum length desired for the
 * output array.
 */

And well, they then just cast it to double in the function. That's the "literal" reason why you get a result of floating data type.

like image 126
MSeifert Avatar answered Oct 21 '22 01:10

MSeifert