When running np.unique(), it first flattens the array, sorts the array, then finds the unique values. When I have arrays with shape (10, 3000, 3000), it takes about a second to find the uniques, but this quickly adds up as I need to call np.unique() multiple times. Since I only care about the total number of unique numbers in an array, sorting seems like a waste of time.
Is there a faster method of find the total number of unique values in a large array other than np.unique()?
Here's a method that works for an array with dtype np.uint8
that is faster than np.unique
.
First, create an array to work with:
In [128]: a = np.random.randint(1, 128, size=(10, 3000, 3000)).astype(np.uint8)
For later comparison, find the unique values using np.unique
:
In [129]: u = np.unique(a)
Here's the faster method; v
will contain the result:
In [130]: q = np.zeros(256, dtype=int)
In [131]: q[a.ravel()] = 1
In [132]: v = np.nonzero(q)[0]
Verify that we got the same result:
In [133]: np.array_equal(u, v)
Out[133]: True
Timing:
In [134]: %timeit u = np.unique(a)
2.86 s ± 9.02 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [135]: %timeit q = np.zeros(256, dtype=int); q[a.ravel()] = 1; v = np.nonzero(q)
300 ms ± 5.52 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
So 2.86 seconds for np.unique()
, and 0.3 seconds for the alternative method.
We could leverage the fact that the elements are restricted to uint8
range by binned-counting with np.bincount
and then simply count the number of non-zeros in it. Since, np.bincount
expects a 1D
array, we would flatten the input with np.ravel()
and then feed it to bincount
.
Hence, the implementation would be -
(np.bincount(a.ravel())!=0).sum()
Runtime test
Helper function to create input array with various number of unique numbers -
def create_input(n_unique):
unq_nums = np.random.choice(np.arange(256), n_unique,replace=0)
return np.random.choice(unq_nums, (10,3000,3000)).astype(np.uint8)
Other approach(es) :
# @Warren Weckesser's soln
def assign_method(a):
q = np.zeros(256, dtype=int)
q[a.ravel()] = 1
return len(np.nonzero(q)[0])
Verification of proposed method -
In [141]: a = create_input(n_unique=120)
In [142]: len(np.unique(a))
Out[142]: 120
In [143]: (np.bincount(a.ravel())!=0).sum()
Out[143]: 120
Timings -
In [124]: a = create_input(n_unique=128)
In [125]: %timeit len(np.unique(a)) # Original soln
...: %timeit assign_method(a) # @Warren Weckesser's soln
...: %timeit (np.bincount(a.ravel())!=0).sum()
...:
1 loop, best of 3: 3.09 s per loop
1 loop, best of 3: 394 ms per loop
1 loop, best of 3: 209 ms per loop
In [126]: a = create_input(n_unique=256)
In [127]: %timeit len(np.unique(a)) # Original soln
...: %timeit assign_method(a) # @Warren Weckesser's soln
...: %timeit (np.bincount(a.ravel())!=0).sum()
...:
1 loop, best of 3: 3.46 s per loop
1 loop, best of 3: 378 ms per loop
1 loop, best of 3: 212 ms per loop
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With