Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Count unique elements along an axis of a NumPy array

I have a three-dimensional array like

A=np.array([[[1,1],
[1,0]],

[[1,2],
[1,0]],

[[1,0],
[0,0]]])

Now I would like to obtain an array that has a nonzero value in a given position if only a unique nonzero value (or zero) occurs in that position. It should have zero if only zeros or more than one nonzero value occur in that position. For the example above, I would like

[[1,0],
[1,0]]

since

  • in A[:,0,0] there are only 1s
  • in A[:,0,1] there are 0, 1 and 2, so more than one nonzero value
  • in A[:,1,0] there are 0 and 1, so 1 is retained
  • in A[:,1,1] there are only 0s

I can find how many nonzero elements there are with np.count_nonzero(A, axis=0), but I would like to keep 1s or 2s even if there are several of them. I looked at np.unique but it doesn't seem to support what I'd like to do.

Ideally, I'd like a function like np.count_unique(A, axis=0) which would return an array in the original shape, e.g. [[1, 3],[2, 1]], so I could check whether 3 or more occur and then ignore that position.


All I could come up with was a list comprehension iterating over the that I'd like to obtain

[[len(np.unique(A[:, i, j])) for j in range(A.shape[2])] for i in range(A.shape[1])]

Any other ideas?

like image 734
Raketenolli Avatar asked Oct 23 '17 15:10

Raketenolli


People also ask

How do you count the number of unique values in an array in Python?

To count each unique element's number of occurrences in the numpy array, we can use the numpy. unique() function. It takes the array as an input argument and returns all the unique elements inside the array in ascending order.

How do you find unique elements in an array?

By using hashmap's key In Java, the simplest way to get unique elements from the array is by putting all elements of the array into hashmap's key and then print the keySet(). The hashmap contains only unique keys, so it will automatically remove that duplicate element from the hashmap keySet.

What does unique () do in Python?

unique() function. The unique() function is used to find the unique elements of an array. Returns the sorted unique elements of an array.

What does NP unique () do in Python?

This function returns an array of unique elements in the input array. The function can be able to return a tuple of array of unique vales and an array of associated indices.


2 Answers

You can use np.diff to stay at numpy level for the second task.

def diffcount(A):
    B=A.copy()
    B.sort(axis=0)
    C=np.diff(B,axis=0)>0
    D=C.sum(axis=0)+1
    return D

# [[1 3]
#  [2 1]]

it's seems to be a little faster on big arrays:

In [62]: A=np.random.randint(0,100,(100,100,100))

In [63]: %timeit diffcount(A)
46.8 ms ± 769 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [64]: timeit [[len(np.unique(A[:, i, j])) for j in range(A.shape[2])]\
for i in range(A.shape[1])]
149 ms ± 700 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Finally counting unique is simpler than sorting, a ln(A.shape[0]) factor can be win.

A way to win this factor is to use the set mechanism :

In [81]: %timeit np.apply_along_axis(lambda a:len(set(a)),axis=0,A) 
183 ms ± 1.17 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Unfortunately, this is not faster.

Another way is to do it by hand :

def countunique(A,Amax):
    res=np.empty(A.shape[1:],A.dtype)
    c=np.empty(Amax+1,A.dtype)
    for i in range(A.shape[1]):
        for j in range(A.shape[2]):
            T=A[:,i,j]
            for k in range(c.size): c[k]=0 
            for x in T:
                c[x]=1
            res[i,j]= c.sum()
    return res 

At python level:

In [70]: %timeit countunique(A,100)
429 ms ± 18.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Which is not so bad for a pure python approach. Then just shift this code at low level with numba :

import numba    
countunique2=numba.jit(countunique)  

In [71]: %timeit countunique2(A,100)
3.63 ms ± 70.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Which will be difficult to improve a lot.

like image 116
B. M. Avatar answered Oct 19 '22 06:10

B. M.


One approach would be to use A as first axis indices for setting a boolean array of the same lengths along the other two axes and then simply counting the non-zeros along the first axis of it. Two variants would be possible - One keeping it as 3D and another would be to reshape into 2D for some performance benefit as indexing into 2D would be faster. Thus, the two implementations would be -

def nunique_axis0_maskcount_app1(A):
    m,n = A.shape[1:]
    mask = np.zeros((A.max()+1,m,n),dtype=bool)
    mask[A,np.arange(m)[:,None],np.arange(n)] = 1
    return mask.sum(0)

def nunique_axis0_maskcount_app2(A):
    m,n = A.shape[1:]
    A.shape = (-1,m*n)
    maxn = A.max()+1
    N = A.shape[1]
    mask = np.zeros((maxn,N),dtype=bool)
    mask[A,np.arange(N)] = 1
    A.shape = (-1,m,n)
    return mask.sum(0).reshape(m,n)

Runtime test -

In [154]: A = np.random.randint(0,100,(100,100,100))

# @B. M.'s soln
In [155]: %timeit f(A)
10 loops, best of 3: 28.3 ms per loop

# @B. M.'s soln using slicing : (B[1:] != B[:-1]).sum(0)+1
In [156]: %timeit f2(A)
10 loops, best of 3: 26.2 ms per loop

In [157]: %timeit nunique_axis0_maskcount_app1(A)
100 loops, best of 3: 12 ms per loop

In [158]: %timeit nunique_axis0_maskcount_app2(A)
100 loops, best of 3: 9.14 ms per loop

Numba method

Using the same strategy as used for nunique_axis0_maskcount_app2 with directly getting the counts at C-level with numba, we would have -

from numba import njit

@njit
def nunique_loopy_func(mask, N, A, p, count):
    for j in range(N):
        mask[:] = True
        mask[A[0,j]] = False
        c = 1
        for i in range(1,p):
            if mask[A[i,j]]:
                c += 1
            mask[A[i,j]] = False
        count[j] = c
    return count

def nunique_axis0_numba(A):
    p,m,n = A.shape
    A.shape = (-1,m*n)
    maxn = A.max()+1
    N = A.shape[1]
    mask = np.empty(maxn,dtype=bool)
    count = np.empty(N,dtype=int)
    out = nunique_loopy_func(mask, N, A, p, count).reshape(m,n)
    A.shape = (-1,m,n)
    return out

Runtime test -

In [328]: np.random.seed(0)

In [329]: A = np.random.randint(0,100,(100,100,100))

In [330]: %timeit nunique_axis0_maskcount_app2(A)
100 loops, best of 3: 11.1 ms per loop

# @B.M.'s numba soln
In [331]: %timeit countunique2(A,A.max()+1)
100 loops, best of 3: 3.43 ms per loop

# Numba soln posted in this post
In [332]: %timeit nunique_axis0_numba(A)
100 loops, best of 3: 2.76 ms per loop
like image 36
Divakar Avatar answered Oct 19 '22 06:10

Divakar