I am seeing behaviour with numpy bincount that I cannot make sense of. I want to bin the values in a 2D array in a row-wise manner and see the behaviour below. Why would it work with dbArray but fail with simarray?
>>> dbArray
array([[1, 0, 1, 0, 1],
[1, 1, 1, 1, 1],
[1, 1, 0, 1, 1],
[1, 0, 0, 0, 0],
[0, 0, 0, 1, 1],
[0, 1, 0, 1, 0]])
>>> N.apply_along_axis(N.bincount,1,dbArray)
array([[2, 3],
[0, 5],
[1, 4],
[4, 1],
[3, 2],
[3, 2]], dtype=int64)
>>> simarray
array([[2, 0, 2, 0, 2],
[2, 1, 2, 1, 2],
[2, 1, 1, 1, 2],
[2, 0, 1, 0, 1],
[1, 0, 1, 1, 2],
[1, 1, 1, 1, 1]])
>>> N.apply_along_axis(N.bincount,1,simarray)
Traceback (most recent call last):
File "<pyshell#31>", line 1, in <module>
N.apply_along_axis(N.bincount,1,simarray)
File "C:\Python27\lib\site-packages\numpy\lib\shape_base.py", line 118, in apply_along_axis
outarr[tuple(i.tolist())] = res
ValueError: could not broadcast input array from shape (2) into shape (3)
bincount() method counts the occurrence of each element. Each bin value is the occurrence of its index. One can also set the bin size accordingly.
Indexing a Two-dimensional Array To access elements in this array, use two indices. One for the row and the other for the column. Note that both the column and the row indices start with 0. So if I need to access the value '10,' use the index '3' for the row and index '1' for the column.
2D array are also called as Matrices which can be represented as collection of rows and columns. In this article, we have explored 2D array in Numpy in Python. NumPy is a library in python adding support for large multidimensional arrays and matrices along with high level mathematical functions to operate these arrays.
The problem is that bincount
isn't always returning the same shaped objects, in particular when values are missing. For example:
>>> m = np.array([[0,0,1],[1,1,0],[1,1,1]])
>>> np.apply_along_axis(np.bincount, 1, m)
array([[2, 1],
[1, 2],
[0, 3]])
>>> [np.bincount(m[i]) for i in range(m.shape[1])]
[array([2, 1]), array([1, 2]), array([0, 3])]
works, but:
>>> m = np.array([[0,0,0],[1,1,0],[1,1,0]])
>>> m
array([[0, 0, 0],
[1, 1, 0],
[1, 1, 0]])
>>> [np.bincount(m[i]) for i in range(m.shape[1])]
[array([3]), array([1, 2]), array([1, 2])]
>>> np.apply_along_axis(np.bincount, 1, m)
Traceback (most recent call last):
File "<ipython-input-49-72e06e26a718>", line 1, in <module>
np.apply_along_axis(np.bincount, 1, m)
File "/usr/local/lib/python2.7/dist-packages/numpy/lib/shape_base.py", line 117, in apply_along_axis
outarr[tuple(i.tolist())] = res
ValueError: could not broadcast input array from shape (2) into shape (1)
won't.
You could use the minlength
parameter and pass it using a lambda
or partial
or something:
>>> np.apply_along_axis(lambda x: np.bincount(x, minlength=2), axis=1, arr=m)
array([[3, 0],
[1, 2],
[1, 2]])
As @DSM has already mentioned, bincount of a 2d array cannot be done without knowing the maximum value of the array, because it would mean an inconsistency of array sizes.
But thanks to the power of numpy's indexing, it was fairly easy to make a faster implementation of 2d bincount, as it doesn't use concatenation or anything.
def bincount2d(arr, bins=None):
if bins is None:
bins = np.max(arr) + 1
count = np.zeros(shape=[len(arr), bins], dtype=np.int64)
indexing = np.arange(len(arr))
for col in arr.T:
count[indexing, col] += 1
return count
t = np.array([[1,2,3],[4,5,6],[3,2,2]], dtype=np.int64)
print(bincount2d(t))
P.S.
This:
t = np.empty(shape=[10000, 100], dtype=np.int64)
s = time.time()
bincount2d(t)
e = time.time()
print(e - s)
gives ~2 times faster result, than this:
t = np.empty(shape=[100, 10000], dtype=np.int64)
s = time.time()
bincount2d(t)
e = time.time()
print(e - s)
because of the for loop iterating over columns. So, it's better to transpose your 2d array, if shape[0] < shape[1]
.
UPD
Better than this can't be done (using python alone, I mean):
def bincount2d(arr, bins=None):
if bins is None:
bins = np.max(arr) + 1
count = np.zeros(shape=[len(arr), bins], dtype=np.int64)
indexing = (np.ones_like(arr).T * np.arange(len(arr))).T
np.add.at(count, (indexing, arr), 1)
return count
This is a function that does exactly what you want, but without any loops.
def sub_sum_partition(a, partition):
"""
Generalization of np.bincount(partition, a).
Sums rows of a matrix for each value of array of non-negative ints.
:param a: array_like
:param partition: array_like, 1 dimension, nonnegative ints
:return: matrix of shape ('one larger than the largest value in partition', a.shape[1:]). The i's element is
the sum of rows j in 'a' s.t. partition[j] == i
"""
assert partition.shape == (len(a),)
n = np.prod(a.shape[1:], dtype=int)
bins = ((np.tile(partition, (n, 1)) * n).T + np.arange(n, dtype=int)).reshape(-1)
sums = np.bincount(bins, a.reshape(-1))
if n > 1:
sums = sums.reshape(-1, *a.shape[1:])
return sums
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With