Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Find unique rows in numpy.array

People also ask

How do I find unique rows in Numpy?

To find unique rows in a NumPy array we are using numpy. unique() function of NumPy library.

How do I find unique elements in a Numpy array in Python?

unique() function. The unique() function is used to find the unique elements of an array. Returns the sorted unique elements of an array.

How do you find unique rows in Python?

Get the unique values (distinct rows) of the dataframe in python pandas. drop_duplicates() function is used to get the unique values (rows) of the dataframe in python pandas. The above drop_duplicates() function removes all the duplicate rows and returns only unique rows.

How do I find unique values in an array?

You can find the distinct values in an array using the Distinct function. The Distinct function takes the array as an input parameter and returns another array that consists only of the unique, or non-duplicate, elements.


As of NumPy 1.13, one can simply choose the axis for selection of unique values in any N-dim array. To get unique rows, one can do:

unique_rows = np.unique(original_array, axis=0)


Yet another possible solution

np.vstack({tuple(row) for row in a})

Another option to the use of structured arrays is using a view of a void type that joins the whole row into a single item:

a = np.array([[1, 1, 1, 0, 0, 0],
              [0, 1, 1, 1, 0, 0],
              [0, 1, 1, 1, 0, 0],
              [1, 1, 1, 0, 0, 0],
              [1, 1, 1, 1, 1, 0]])

b = np.ascontiguousarray(a).view(np.dtype((np.void, a.dtype.itemsize * a.shape[1])))
_, idx = np.unique(b, return_index=True)

unique_a = a[idx]

>>> unique_a
array([[0, 1, 1, 1, 0, 0],
       [1, 1, 1, 0, 0, 0],
       [1, 1, 1, 1, 1, 0]])

EDIT Added np.ascontiguousarray following @seberg's recommendation. This will slow the method down if the array is not already contiguous.

EDIT The above can be slightly sped up, perhaps at the cost of clarity, by doing:

unique_a = np.unique(b).view(a.dtype).reshape(-1, a.shape[1])

Also, at least on my system, performance wise it is on par, or even better, than the lexsort method:

a = np.random.randint(2, size=(10000, 6))

%timeit np.unique(a.view(np.dtype((np.void, a.dtype.itemsize*a.shape[1])))).view(a.dtype).reshape(-1, a.shape[1])
100 loops, best of 3: 3.17 ms per loop

%timeit ind = np.lexsort(a.T); a[np.concatenate(([True],np.any(a[ind[1:]]!=a[ind[:-1]],axis=1)))]
100 loops, best of 3: 5.93 ms per loop

a = np.random.randint(2, size=(10000, 100))

%timeit np.unique(a.view(np.dtype((np.void, a.dtype.itemsize*a.shape[1])))).view(a.dtype).reshape(-1, a.shape[1])
10 loops, best of 3: 29.9 ms per loop

%timeit ind = np.lexsort(a.T); a[np.concatenate(([True],np.any(a[ind[1:]]!=a[ind[:-1]],axis=1)))]
10 loops, best of 3: 116 ms per loop

If you want to avoid the memory expense of converting to a series of tuples or another similar data structure, you can exploit numpy's structured arrays.

The trick is to view your original array as a structured array where each item corresponds to a row of the original array. This doesn't make a copy, and is quite efficient.

As a quick example:

import numpy as np

data = np.array([[1, 1, 1, 0, 0, 0],
                 [0, 1, 1, 1, 0, 0],
                 [0, 1, 1, 1, 0, 0],
                 [1, 1, 1, 0, 0, 0],
                 [1, 1, 1, 1, 1, 0]])

ncols = data.shape[1]
dtype = data.dtype.descr * ncols
struct = data.view(dtype)

uniq = np.unique(struct)
uniq = uniq.view(data.dtype).reshape(-1, ncols)
print uniq

To understand what's going on, have a look at the intermediary results.

Once we view things as a structured array, each element in the array is a row in your original array. (Basically, it's a similar data structure to a list of tuples.)

In [71]: struct
Out[71]:
array([[(1, 1, 1, 0, 0, 0)],
       [(0, 1, 1, 1, 0, 0)],
       [(0, 1, 1, 1, 0, 0)],
       [(1, 1, 1, 0, 0, 0)],
       [(1, 1, 1, 1, 1, 0)]],
      dtype=[('f0', '<i8'), ('f1', '<i8'), ('f2', '<i8'), ('f3', '<i8'), ('f4', '<i8'), ('f5', '<i8')])

In [72]: struct[0]
Out[72]:
array([(1, 1, 1, 0, 0, 0)],
      dtype=[('f0', '<i8'), ('f1', '<i8'), ('f2', '<i8'), ('f3', '<i8'), ('f4', '<i8'), ('f5', '<i8')])

Once we run numpy.unique, we'll get a structured array back:

In [73]: np.unique(struct)
Out[73]:
array([(0, 1, 1, 1, 0, 0), (1, 1, 1, 0, 0, 0), (1, 1, 1, 1, 1, 0)],
      dtype=[('f0', '<i8'), ('f1', '<i8'), ('f2', '<i8'), ('f3', '<i8'), ('f4', '<i8'), ('f5', '<i8')])

That we then need to view as a "normal" array (_ stores the result of the last calculation in ipython, which is why you're seeing _.view...):

In [74]: _.view(data.dtype)
Out[74]: array([0, 1, 1, 1, 0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 0])

And then reshape back into a 2D array (-1 is a placeholder that tells numpy to calculate the correct number of rows, give the number of columns):

In [75]: _.reshape(-1, ncols)
Out[75]:
array([[0, 1, 1, 1, 0, 0],
       [1, 1, 1, 0, 0, 0],
       [1, 1, 1, 1, 1, 0]])

Obviously, if you wanted to be more concise, you could write it as:

import numpy as np

def unique_rows(data):
    uniq = np.unique(data.view(data.dtype.descr * data.shape[1]))
    return uniq.view(data.dtype).reshape(-1, data.shape[1])

data = np.array([[1, 1, 1, 0, 0, 0],
                 [0, 1, 1, 1, 0, 0],
                 [0, 1, 1, 1, 0, 0],
                 [1, 1, 1, 0, 0, 0],
                 [1, 1, 1, 1, 1, 0]])
print unique_rows(data)

Which results in:

[[0 1 1 1 0 0]
 [1 1 1 0 0 0]
 [1 1 1 1 1 0]]

np.unique when I run it on np.random.random(100).reshape(10,10) returns all the unique individual elements, but you want the unique rows, so first you need to put them into tuples:

array = #your numpy array of lists
new_array = [tuple(row) for row in array]
uniques = np.unique(new_array)

That is the only way I see you changing the types to do what you want, and I am not sure if the list iteration to change to tuples is okay with your "not looping through"


np.unique works by sorting a flattened array, then looking at whether each item is equal to the previous. This can be done manually without flattening:

ind = np.lexsort(a.T)
a[ind[np.concatenate(([True],np.any(a[ind[1:]]!=a[ind[:-1]],axis=1)))]]

This method does not use tuples, and should be much faster and simpler than other methods given here.

NOTE: A previous version of this did not have the ind right after a[, which mean that the wrong indices were used. Also, Joe Kington makes a good point that this does make a variety of intermediate copies. The following method makes fewer, by making a sorted copy and then using views of it:

b = a[np.lexsort(a.T)]
b[np.concatenate(([True], np.any(b[1:] != b[:-1],axis=1)))]

This is faster and uses less memory.

Also, if you want to find unique rows in an ndarray regardless of how many dimensions are in the array, the following will work:

b = a[lexsort(a.reshape((a.shape[0],-1)).T)];
b[np.concatenate(([True], np.any(b[1:]!=b[:-1],axis=tuple(range(1,a.ndim)))))]

An interesting remaining issue would be if you wanted to sort/unique along an arbitrary axis of an arbitrary-dimension array, something that would be more difficult.

Edit:

To demonstrate the speed differences, I ran a few tests in ipython of the three different methods described in the answers. With your exact a, there isn't too much of a difference, though this version is a bit faster:

In [87]: %timeit unique(a.view(dtype)).view('<i8')
10000 loops, best of 3: 48.4 us per loop

In [88]: %timeit ind = np.lexsort(a.T); a[np.concatenate(([True], np.any(a[ind[1:]]!= a[ind[:-1]], axis=1)))]
10000 loops, best of 3: 37.6 us per loop

In [89]: %timeit b = [tuple(row) for row in a]; np.unique(b)
10000 loops, best of 3: 41.6 us per loop

With a larger a, however, this version ends up being much, much faster:

In [96]: a = np.random.randint(0,2,size=(10000,6))

In [97]: %timeit unique(a.view(dtype)).view('<i8')
10 loops, best of 3: 24.4 ms per loop

In [98]: %timeit b = [tuple(row) for row in a]; np.unique(b)
10 loops, best of 3: 28.2 ms per loop

In [99]: %timeit ind = np.lexsort(a.T); a[np.concatenate(([True],np.any(a[ind[1:]]!= a[ind[:-1]],axis=1)))]
100 loops, best of 3: 3.25 ms per loop