Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Efficient way to get union of set of vectors in Numpy

I'm trying to implement a specific binary search algorithm. "Results" should be an empty set in the beginning, and during the search, Results variable will become a union with the new results that we get.

Basically:

results = set()
for result in search():
  results = results.union(result)

But such code won't really work with Numpy arrays, so we use np.union1d for this purpose:

results = np.array([])
for result in search():
    result = np.union1d(results, result)

The code above doesn't really work either, since if we have for example two vectors a = [1,2,3] and b=[3,4,5], np.union1d(a, b) will return:

[1, 2, 3, 4, 5]

But I want it to return:

[[1, 2, 3], [3,4,5]]

Since there are no duplicate vectors, if we had for example union([[1, 2, 3], [3,4,5]], [1,2,3]), return value shall remain:

[[1, 2, 3], [3,4,5]]


So I would say that I require a numpy array based union.

I also considered using np.append(a, b) and then np.unique(x), but both of the functions project lower dimensional array to higher dimensional one. np.append also has axis=0 property, which retains dimension of all arrays inserted, but I couldn't efficiently implement it without getting dimension error.

Question:

How can I efficiently implement a vector based set? So that points in the union will be considered as vectors instead of scalars, and will retain their vector form and dimension.

like image 251
ShellRox Avatar asked Nov 07 '22 22:11

ShellRox


1 Answers

Here's some basic set operations.

Define a pair of lists (they could be np.array([1,2,3]), but that's not what you show.

In [261]: a = [1,2,3]; b=[3,4,5]

A list of several of those:

In [263]: alist = [a, b, a]
In [264]: alist
Out[264]: [[1, 2, 3], [3, 4, 5], [1, 2, 3]]

I can get the unique values by converting to tuples and putting them in a set.

In [265]: set([tuple(i) for i in alist])
Out[265]: {(1, 2, 3), (3, 4, 5)}

I can also convert that list into a 2d array:

In [266]: arr = np.array(alist)
In [267]: arr
Out[267]: 
array([[1, 2, 3],
       [3, 4, 5],
       [1, 2, 3]])

and get the unique rows with unique and an axis parameter:

In [269]: np.unique(arr, axis=0)
Out[269]: 
array([[1, 2, 3],
       [3, 4, 5]])

Compare the times

In [270]: timeit np.unique(arr, axis=0)
46.5 µs ± 142 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [271]: timeit set([tuple(i) for i in alist])
1.01 µs ± 1.7 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

Converting array to list or list to array adds some time, but the basic pattern remains.

In [272]: timeit set([tuple(i) for i in arr.tolist()])
1.53 µs ± 13.2 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
In [273]: timeit np.unique(alist, axis=0)
53.3 µs ± 90.3 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

For larger, realistic sources relative timings may change a bit, but I expect that the set of tuples will be remain the best. Set operations are not a numpy strong point. unique does a sort, followed by a elimination of duplicates. set uses a hashing method, similar to what Python uses for dictionaries.

If you must collect values iteratively from a source, I'd suggest building a list, and doing the set/unique once.

alist = []
for x in source():
    alist.append(x)

or one of:

alist = [x for x in source()]
alist = list(source())
alist = [tuple(x) for x in source()]
alist = [tuple(x.tolist()) for x in source()]
like image 58
hpaulj Avatar answered Nov 13 '22 17:11

hpaulj