Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Efficiently counting number of unique elements - NumPy / Python

When running np.unique(), it first flattens the array, sorts the array, then finds the unique values. When I have arrays with shape (10, 3000, 3000), it takes about a second to find the uniques, but this quickly adds up as I need to call np.unique() multiple times. Since I only care about the total number of unique numbers in an array, sorting seems like a waste of time.

Is there a faster method of find the total number of unique values in a large array other than np.unique()?

like image 875
onepint16oz Avatar asked Oct 04 '17 22:10

onepint16oz


2 Answers

Here's a method that works for an array with dtype np.uint8 that is faster than np.unique.

First, create an array to work with:

In [128]: a = np.random.randint(1, 128, size=(10, 3000, 3000)).astype(np.uint8)

For later comparison, find the unique values using np.unique:

In [129]: u = np.unique(a)

Here's the faster method; v will contain the result:

In [130]: q = np.zeros(256, dtype=int)

In [131]: q[a.ravel()] = 1

In [132]: v = np.nonzero(q)[0]

Verify that we got the same result:

In [133]: np.array_equal(u, v)
Out[133]: True

Timing:

In [134]: %timeit u = np.unique(a)
2.86 s ± 9.02 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [135]: %timeit q = np.zeros(256, dtype=int); q[a.ravel()] = 1; v = np.nonzero(q)
300 ms ± 5.52 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

So 2.86 seconds for np.unique(), and 0.3 seconds for the alternative method.

like image 190
Warren Weckesser Avatar answered Oct 27 '22 01:10

Warren Weckesser


We could leverage the fact that the elements are restricted to uint8 range by binned-counting with np.bincount and then simply count the number of non-zeros in it. Since, np.bincount expects a 1D array, we would flatten the input with np.ravel() and then feed it to bincount.

Hence, the implementation would be -

(np.bincount(a.ravel())!=0).sum()

Runtime test

Helper function to create input array with various number of unique numbers -

def create_input(n_unique):
    unq_nums = np.random.choice(np.arange(256), n_unique,replace=0)
    return np.random.choice(unq_nums, (10,3000,3000)).astype(np.uint8)

Other approach(es) :

# @Warren Weckesser's soln
def assign_method(a):
    q = np.zeros(256, dtype=int)
    q[a.ravel()] = 1
    return len(np.nonzero(q)[0])

Verification of proposed method -

In [141]: a = create_input(n_unique=120)

In [142]: len(np.unique(a))
Out[142]: 120

In [143]: (np.bincount(a.ravel())!=0).sum()
Out[143]: 120

Timings -

In [124]: a = create_input(n_unique=128)

In [125]: %timeit len(np.unique(a)) # Original soln
     ...: %timeit assign_method(a)  # @Warren Weckesser's soln
     ...: %timeit (np.bincount(a.ravel())!=0).sum()
     ...: 
1 loop, best of 3: 3.09 s per loop
1 loop, best of 3: 394 ms per loop
1 loop, best of 3: 209 ms per loop

In [126]: a = create_input(n_unique=256)

In [127]: %timeit len(np.unique(a)) # Original soln
     ...: %timeit assign_method(a)  # @Warren Weckesser's soln
     ...: %timeit (np.bincount(a.ravel())!=0).sum()
     ...: 
1 loop, best of 3: 3.46 s per loop
1 loop, best of 3: 378 ms per loop
1 loop, best of 3: 212 ms per loop
like image 30
Divakar Avatar answered Oct 27 '22 00:10

Divakar