I have a list of unique rows and another larger array of data (called test_rows in example). I was wondering if there was a faster way to get the location of each unique row in the data. The fastest way that I could come up with is... <pre class="prettyprint"><code>import numpy uniq_rows = numpy.array([[0, 1, 0], [1, 1, 0], [1, 1, 1], [0, 1, 1]]) test_rows = numpy.array([[0, 1, 1], [0, 1, 0], [0, 0, 0], [1, 1, 0], [0, 1, 0], [0, 1, 1], [0, 1, 1], [1, 1, 1], [1, 1, 0], [1, 1, 1], [0, 1, 0], [0, 0, 0], [1, 1, 0]]) # this gives me the indexes of each group of unique rows for row in uniq_rows.tolist(): print row, numpy.where((test_rows == row).all(axis=1))[0] </code></pre> This prints... <pre class="prettyprint"><code>[0, 1, 0] [ 1 4 10] [1, 1, 0] [ 3 8 12] [1, 1, 1] [7 9] [0, 1, 1] [0 5 6] </code></pre> Is there a better or more numpythonic (not sure if that word exists) way to do this? I was searching for a numpy group function but could not find it. Basically for any incoming dataset I need the fastest way to get the locations of each unique row in that data set. The incoming dataset will not always have every unique row or the same number. EDIT: This is just a simple example. In my application the numbers would not be just zeros and ones, they could be anywhere from 0 to 32000. The size of uniq rows could be between 4 to 128 rows and the size of test_rows could be in the hundreds of thousands.

<h3>Numpy</h3> From version 1.13 of numpy you can use numpy.unique like <code>np.unique(test_rows, return_counts=True, return_index=True, axis=1)</code> <h3>Pandas</h3> <pre class="prettyprint"><code>df = pd.DataFrame(test_rows) uniq = pd.DataFrame(uniq_rows) </code></pre> <blockquote> uniq </blockquote> <pre class="prettyprint"><code> 0 1 2 0 0 1 0 1 1 1 0 2 1 1 1 3 0 1 1 </code></pre> Or you could generate the unique rows automatically from the incoming DataFrame <pre class="prettyprint"><code>uniq_generated = df.drop_duplicates().reset_index(drop=True) </code></pre> yields <pre class="prettyprint"><code> 0 1 2 0 0 1 1 1 0 1 0 2 0 0 0 3 1 1 0 4 1 1 1 </code></pre> and then look for it <pre class="prettyprint"><code>d = dict() for idx, row in uniq.iterrows(): d[idx] = df.index[(df == row).all(axis=1)].values </code></pre> This is about the same as your <code>where</code> method <blockquote> d </blockquote> <pre class="prettyprint"><code>{0: array([ 1, 4, 10], dtype=int64), 1: array([ 3, 8, 12], dtype=int64), 2: array([7, 9], dtype=int64), 3: array([0, 5, 6], dtype=int64)} </code></pre>

There are a lot of solutions here, but I'm adding one with vanilla numpy. In most cases numpy will be faster than list comprehensions and dictionaries, although the array broadcasting may cause memory to be an issue if large arrays are used. <pre class="prettyprint"><code>np.where((uniq_rows[:, None, :] == test_rows).all(2)) </code></pre> Wonderfully simple, eh? This returns a tuple of unique row indices and the corresponding test row. <pre class="prettyprint"><code> (array([0, 0, 0, 1, 1, 1, 2, 2, 3, 3, 3]), array([ 1, 4, 10, 3, 8, 12, 7, 9, 0, 5, 6])) </code></pre> How it works: <pre class="prettyprint"><code>(uniq_rows[:, None, :] == test_rows) </code></pre> Uses array broadcasting to compare each element of <code>test_rows</code> with each row in <code>uniq_rows</code>. This results in a 4x13x3 array. <code>all</code> is used to determine which rows are equal (all comparisons returned true). Finally, <code>where</code> returns the indices of these rows.

What is a faster way to get the location of unique rows in numpy

I have a list of unique rows and another larger array of data (called test_rows in example). I was wondering if there was a faster way to get the location of each unique row in the data. The fastest way that I could come up with is...

import numpy


uniq_rows = numpy.array([[0, 1, 0],
                         [1, 1, 0],
                         [1, 1, 1],
                         [0, 1, 1]])

test_rows = numpy.array([[0, 1, 1],
                         [0, 1, 0],
                         [0, 0, 0],
                         [1, 1, 0],
                         [0, 1, 0],
                         [0, 1, 1],
                         [0, 1, 1],
                         [1, 1, 1],
                         [1, 1, 0],
                         [1, 1, 1],
                         [0, 1, 0],
                         [0, 0, 0],
                         [1, 1, 0]])

# this gives me the indexes of each group of unique rows
for row in uniq_rows.tolist():
    print row, numpy.where((test_rows == row).all(axis=1))[0]

This prints...

[0, 1, 0] [ 1  4 10]
[1, 1, 0] [ 3  8 12]
[1, 1, 1] [7 9]
[0, 1, 1] [0 5 6]

Is there a better or more numpythonic (not sure if that word exists) way to do this? I was searching for a numpy group function but could not find it. Basically for any incoming dataset I need the fastest way to get the locations of each unique row in that data set. The incoming dataset will not always have every unique row or the same number.

EDIT: This is just a simple example. In my application the numbers would not be just zeros and ones, they could be anywhere from 0 to 32000. The size of uniq rows could be between 4 to 128 rows and the size of test_rows could be in the hundreds of thousands.

How do I find unique rows in NumPy?

To find unique rows in a NumPy array we are using numpy. unique() function of NumPy library.

How do you find unique rows in Python?

Get the unique values (distinct rows) of the dataframe in python pandas. drop_duplicates() function is used to get the unique values (rows) of the dataframe in python pandas. The above drop_duplicates() function removes all the duplicate rows and returns only unique rows.

How can I speed up my NumPy operation?

By explicitly declaring the "ndarray" data type, your array processing can be 1250x faster. This tutorial will show you how to speed up the processing of NumPy arrays using Cython. By explicitly specifying the data types of variables in Python, Cython can give drastic speed increases at runtime.

Is set or NP unique faster?

unique seems to be faster than numpy. unique .

Numpy

From version 1.13 of numpy you can use numpy.unique like np.unique(test_rows, return_counts=True, return_index=True, axis=1)

Pandas

df = pd.DataFrame(test_rows)
uniq = pd.DataFrame(uniq_rows)

uniq

    0   1   2
0   0   1   0
1   1   1   0
2   1   1   1
3   0   1   1

Or you could generate the unique rows automatically from the incoming DataFrame

uniq_generated = df.drop_duplicates().reset_index(drop=True)

yields

    0   1   2
0   0   1   1
1   0   1   0
2   0   0   0
3   1   1   0
4   1   1   1

and then look for it

d = dict()
for idx, row in uniq.iterrows():
    d[idx] = df.index[(df == row).all(axis=1)].values

This is about the same as your where method

d

{0: array([ 1,  4, 10], dtype=int64),
 1: array([ 3,  8, 12], dtype=int64),
 2: array([7, 9], dtype=int64),
 3: array([0, 5, 6], dtype=int64)}

With the np.unique from v1.13 (downloaded from the source link on the latest documentation, https://github.com/numpy/numpy/blob/master/numpy/lib/arraysetops.py#L112-L247)

In [157]: aset.unique(test_rows, axis=0,return_inverse=True,return_index=True)
Out[157]: 
(array([[0, 0, 0],
        [0, 1, 0],
        [0, 1, 1],
        [1, 1, 0],
        [1, 1, 1]]),
 array([2, 1, 0, 3, 7], dtype=int32),
 array([2, 1, 0, 3, 1, 2, 2, 4, 3, 4, 1, 0, 3], dtype=int32))

In [158]: a,b,c=_
In [159]: c
Out[159]: array([2, 1, 0, 3, 1, 2, 2, 4, 3, 4, 1, 0, 3], dtype=int32)
In [164]: from collections import defaultdict
In [165]: dd = defaultdict(list)
In [166]: for i,v in enumerate(c):
     ...:     dd[v].append(i)
     ...:     
In [167]: dd
Out[167]: 
defaultdict(list,
            {0: [2, 11],
             1: [1, 4, 10],
             2: [0, 5, 6],
             3: [3, 8, 12],
             4: [7, 9]})

or indexing the dictionary with the unique rows (as hashable tuple):

In [170]: dd = defaultdict(list)
In [171]: for i,v in enumerate(c):
     ...:     dd[tuple(a[v])].append(i)
     ...:     
In [172]: dd
Out[172]: 
defaultdict(list,
            {(0, 0, 0): [2, 11],
             (0, 1, 0): [1, 4, 10],
             (0, 1, 1): [0, 5, 6],
             (1, 1, 0): [3, 8, 12],
             (1, 1, 1): [7, 9]})

There are a lot of solutions here, but I'm adding one with vanilla numpy. In most cases numpy will be faster than list comprehensions and dictionaries, although the array broadcasting may cause memory to be an issue if large arrays are used.

np.where((uniq_rows[:, None, :] == test_rows).all(2))

Wonderfully simple, eh? This returns a tuple of unique row indices and the corresponding test row.

 (array([0, 0, 0, 1, 1, 1, 2, 2, 3, 3, 3]),
  array([ 1,  4, 10,  3,  8, 12,  7,  9,  0,  5,  6]))

How it works:

(uniq_rows[:, None, :] == test_rows)

Uses array broadcasting to compare each element of test_rows with each row in uniq_rows. This results in a 4x13x3 array. all is used to determine which rows are equal (all comparisons returned true). Finally, where returns the indices of these rows.

What is a faster way to get the location of unique rows in numpy

Tags:

python

numpy

scipy

b10hazard

People also ask

3 Answers

Numpy

Pandas

Maarten Fabré

hpaulj

user2699

Recent Activity

Donate For Us

What is a faster way to get the location of unique rows in numpy

Tags:

python

numpy

scipy

b10hazard

People also ask

3 Answers

Numpy

Pandas

Maarten Fabré

hpaulj

user2699

Related questions

Recent Activity

Donate For Us