For example, for
a = np.array([[1, 0, 0], [1, 0, 0], [2, 3, 4]])
I want to get
[2, 2, 3]
Is there a way to do this without for loops or using np.vectorize
?
Edit: Actual data consists of 1000 rows of 100 elements each, with each element ranging from 1 to 365. The ultimate goal is to determine the percentage of rows that have duplicates. This was a homework problem which I already solved (with a for loop), but I was just wondering if there was a better way to do it with numpy.
Approach #1
One vectorized approach with sorting -
In [8]: b = np.sort(a,axis=1)
In [9]: (b[:,1:] != b[:,:-1]).sum(axis=1)+1
Out[9]: array([2, 2, 3])
Approach #2
Another method for ints
that aren't very large would be with offsetting each row by an offset that would differentiate elements off each row from others and then doing binned-summation and counting number of non-zero bins per row -
n = a.max()+1
a_off = a+(np.arange(a.shape[0])[:,None])*n
M = a.shape[0]*n
out = (np.bincount(a_off.ravel(), minlength=M).reshape(-1,n)!=0).sum(1)
Approaches as funcs -
def sorting(a):
b = np.sort(a,axis=1)
return (b[:,1:] != b[:,:-1]).sum(axis=1)+1
def bincount(a):
n = a.max()+1
a_off = a+(np.arange(a.shape[0])[:,None])*n
M = a.shape[0]*n
return (np.bincount(a_off.ravel(), minlength=M).reshape(-1,n)!=0).sum(1)
# From @wim's post
def pandas(a):
df = pd.DataFrame(a.T)
return df.nunique()
# @jp_data_analysis's soln
def numpy_apply(a):
return np.apply_along_axis(compose(len, np.unique), 1, a)
Case #1 : Square shaped one
In [164]: np.random.seed(0)
In [165]: a = np.random.randint(0,5,(10000,10000))
In [166]: %timeit numpy_apply(a)
...: %timeit sorting(a)
...: %timeit bincount(a)
...: %timeit pandas(a)
1 loop, best of 3: 1.82 s per loop
1 loop, best of 3: 1.93 s per loop
1 loop, best of 3: 354 ms per loop
1 loop, best of 3: 879 ms per loop
Case #2 : Large number of rows
In [167]: np.random.seed(0)
In [168]: a = np.random.randint(0,5,(1000000,10))
In [169]: %timeit numpy_apply(a)
...: %timeit sorting(a)
...: %timeit bincount(a)
...: %timeit pandas(a)
1 loop, best of 3: 8.42 s per loop
10 loops, best of 3: 153 ms per loop
10 loops, best of 3: 66.8 ms per loop
1 loop, best of 3: 53.6 s per loop
Extending to number of unique elements per column
To extend, we just need to do the slicing and ufunc operations along the other axis for the two proposed approaches, like so -
def nunique_percol_sort(a):
b = np.sort(a,axis=0)
return (b[1:] != b[:-1]).sum(axis=0)+1
def nunique_percol_bincount(a):
n = a.max()+1
a_off = a+(np.arange(a.shape[1]))*n
M = a.shape[1]*n
return (np.bincount(a_off.ravel(), minlength=M).reshape(-1,n)!=0).sum(1)
Let's see how we can extend to ndarray of generic dimensions and get those number of unique counts along a generic axis. We will make use of np.diff
with its axis
param to get those consecutive differences and hence make it generic, like so -
def nunique(a, axis):
return (np.diff(np.sort(a,axis=axis),axis=axis)!=0).sum(axis=axis)+1
Sample runs -
In [77]: a
Out[77]:
array([[1, 0, 2, 2, 0],
[1, 0, 1, 2, 0],
[0, 0, 0, 0, 2],
[1, 2, 1, 0, 1],
[2, 0, 1, 0, 0]])
In [78]: nunique(a, axis=0)
Out[78]: array([3, 2, 3, 2, 3])
In [79]: nunique(a, axis=1)
Out[79]: array([3, 3, 2, 3, 3])
If you are working with floating pt numbers and want to make the unique-ness case based on some tolerance value rather than absolute match, we can use np.isclose
. Two such options would be -
(~np.isclose(np.diff(np.sort(a,axis=axis),axis=axis),0)).sum(axis)+1
a.shape[axis]-np.isclose(np.diff(np.sort(a,axis=axis),axis=axis),0).sum(axis)
For a custom tolerance value, feed those with np.isclose
.
This solution via np.apply_along_axis
isn't vectorised and involves a Python-level loop. But it is relatively intuitive using len
+ np.unique
functions.
import numpy as np
from toolz import compose
a = np.array([[1, 0, 0], [1, 0, 0], [2, 3, 4]])
np.apply_along_axis(compose(len, np.unique), 1, a) # [2, 2, 3]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With