Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Numpy remove duplicate column values

I have a numpy array as follows

array([[ 6,  5],
   [ 6,  9],
   [ 7,  5],
   [ 7,  9],
   [ 8, 10],
   [ 9, 10],
   [ 9, 11],
   [10, 10]])

I want to pick elements such that y coordinates are unique. If two y coordinates are same I want to pick element with lesser x coordinate.

Expected output

array([[ 6,  5],
   [ 6,  9],
   [ 8, 10],
   [ 9, 11]])

Explanation

pick [6,5] over [7,5]

pick [8,10] over [9,10] and [10,10]

pick [9, 11]

Thanks

like image 262
user009122 Avatar asked Jul 08 '18 22:07

user009122


3 Answers

First, sort by the first column:

a = a[a[:, 0].argsort()]

Returning unique indices using np.unique with the return_index flag:

a[np.unique(a[:, 1], return_index=True)[1]]

array([[ 6,  5],
       [ 6,  9],
       [ 8, 10],
       [ 9, 11]])

Some timings:

a = np.random.randint(1, 10, 10000).reshape(-1, 2)

In [45]: %timeit rows_by_unique_y(a)
3.83 ms ± 137 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [46]: %timeit argsort_unique(a)
370 µs ± 8.26 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Yes, my approach uses an initial sort, but vectorized operations in numpy beat iteration in Python.

like image 56
user3483203 Avatar answered Oct 21 '22 10:10

user3483203


If you are open to using another library, I would suggest using numpy_indexed for an efficient and compact solution

import numpy as np
import numpy_indexed as npi

a = np.array([[6, 5], [6, 9], [7, 5], [7, 9], [8, 10], [9, 10], [9, 11], [10, 10]])

column_to_groupby = 1
groups, reduced = npi.group_by(a[:,column_to_groupby]).min(a)
print(reduced)

It gives the following output

[[ 6  5]
 [ 6  9]
 [ 8 10]
 [ 9 11]]

Here is the timeit result

In [5]: a = np.random.randint(1, 10, 10000).reshape(-1, 2)

In [6]: %timeit npi.group_by(a[:,1]).min(a)
354 µs ± 2.29 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
like image 26
lakshayg Avatar answered Oct 21 '22 10:10

lakshayg


One approach loop through the array and make a note of the best values you've seen, then reconstruct the array at the end:

import numpy as np

def rows_by_unique_y(arr):
  best_for_y = defaultdict(lambda: float('inf'))
  for i, row in enumerate(arr):
    x,y = row[0], row[1]
    best_for_y[y] = min(x, best_for_y[y])
  return np.array([[x,y] for y, x in best_for_y.items()])

arr = np.array([[6,  5], [6,  9], [7,  5], [7,  9], [8, 10], [9, 10], [9, 11], [10, 10]])
print(rows_by_unique_y(arr))

No need to sort, just keep track of the minimums. This outputs:

[[ 6  5]
 [ 6  9]
 [ 8 10]
 [ 9 11]]

While this answer is asymptotically faster, user3483203's answer is much better in practice. This is because it calls out to optimized C code rather than staying inside of Python's surprisingly slow interpreter. However, if your arrays are huge (several gigabytes) then the O(n log n) behavior will start to lose to this.

At the same time, if your arrays are that large, you should probably be using a MapReduce framework like Spark instead. The algorithm I gave above is easily parallelized.


If you don't need the minimum x values, then the following one-liner using np.unique works:

arr[np.unique(arr[:,1], return_index=True)[1]]

but this returns

array([[ 6,  5],
       [ 6,  9],
       [10, 10],
       [ 9, 11]])

if you switch the 8 and the 10.

like image 25
Alex Reinking Avatar answered Oct 21 '22 08:10

Alex Reinking