Is there any fast way to obtain unique elements in numpy? I have code similar to this (the last line)
tab = numpy.arange(100000000)
indices1 = numpy.random.permutation(10000)
indices2 = indices1.copy()
indices3 = indices1.copy()
indices4 = indices1.copy()
result = numpy.unique(numpy.array([tab[indices1], tab[indices2], tab[indices3], tab[indices4]]))
This is just an example and in my situation indices1, indices2,...,indices4
contains different set of indices and have various size. The last line is executed many times and Inoticed that it's actually the bottleneck in my code ({numpy.core.multiarray.arange}
to be precesive). Besides, ordering is not important and element in indices array are of int32
type. I was thinking about using hashtable with element value as key and tried:
seq = itertools.chain(tab[indices1].flatten(), tab[indices2].flatten(), tab[indices3].flatten(), tab[indices4].flatten())
myset = {}
map(myset.__setitem__, seq, [])
result = numpy.array(myset.keys())
but it was even worse.
Is there any way to speed this up? I guess the performance penalty comes from 'fancy indexing' that copy the array but I need the resulting element only to read (I don't modify anything).
pandas provides a bunch of C or Cython optimized functions that can be faster than the NumPy equivalent function (e.g. reading text from text files).
NumPy Arrays are faster than Python Lists because of the following reasons: An array is a collection of homogeneous data-types that are stored in contiguous memory locations. On the other hand, a list in Python is a collection of heterogeneous data types stored in non-contiguous memory locations.
[What follows is actually partially incorrect (see the PS):]
The following way of obtaining the unique elements in all sub-arrays is very fast:
seq = itertools.chain(tab[indices1].flat, tab[indices2].flat, tab[indices3].flat, tab[indices4].flat)
result = set(seq)
Note that flat
(which returns an iterator) is used instead of flatten()
(which returns a full array), and that set()
can be called directly (instead of using map()
and a dictionary, as in your second method).
Here are the timing results (obtained in the IPython shell):
>>> %timeit result = numpy.unique(numpy.array([tab[indices1], tab[indices2], tab[indices3], tab[indices4]]))
100 loops, best of 3: 8.04 ms per loop
>>> seq = itertools.chain(tab[indices1].flat, tab[indices2].flat, tab[indices3].flat, tab[indices4].flat)
>>> %timeit set(seq)
1000000 loops, best of 3: 223 ns per loop
The set/flat method is thus 40 times faster on this example.
PS: The timing of set(seq)
is actually not representative. In fact, the first loop of the timing empties the seq
iterator and the subsequent set()
evaluations return an empty set! The correct timing test is the following
>>> %timeit set(itertools.chain(tab[indices1].flat, tab[indices2].flat, tab[indices3].flat, tab[indices4].flat))
100 loops, best of 3: 9.12 ms per loop
which shows that the set/flat method is actually not faster.
PPS: here is an (unsuccessful) exploration of mtrw's suggestion; finding the unique indices beforehand might be a good idea, but I can't find a way to implement it which is faster than the above approach:
>>> %timeit set(indices1).union(indices2).union(indices3).union(indices4)
100 loops, best of 3: 11.9 ms per loop
>>> %timeit set(itertools.chain(indices1.flat, indices2.flat, indices3.flat, indices4.flat))
100 loops, best of 3: 10.8 ms per loop
Thus, finding the set of all distinct indices is itself quite slow.
PPPS: numpy.unique(<concatenated array of indices>)
is actually 2-3 times faster than set(<concatenated array of indices>)
. This is key to the speed up obtained in Bago's answer (unique(concatenate((…)))
). The reason is probably that letting NumPy handle its arrays by itself is generally faster than interfacing pure Python (set
) with NumPy arrays.
Conclusion: this answer therefore only documents failed attempts that should not be fully followed, as well as a possibly useful remark about timing code with iterators…
Sorry I don't completely understand your question, but I'll do my best to help.
Fist {numpy.core.multiarray.arange} is numpy.arange not fancy indexing, unfortunately fancy indexing does not show up as a separate line item in the profiler. If you're calling np.arange in the loop you, should see if you can move it outside.
In [27]: prun tab[tab]
2 function calls in 1.551 CPU seconds
Ordered by: internal time
ncalls tottime percall cumtime percall filename:lineno(function)
1 1.551 1.551 1.551 1.551 <string>:1(<module>)
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
In [28]: prun numpy.arange(10000000)
3 function calls in 0.051 CPU seconds
Ordered by: internal time
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.047 0.047 0.047 0.047 {numpy.core.multiarray.arange}
1 0.003 0.003 0.051 0.051 <string>:1(<module>)
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
Second I assume that tab
is not np.arange(a, b)
in your code, because if it is than tab[index] == index + a
, but I assume that was just for your example.
Third, np.concatenate is about 10 times faster than np.array
In [47]: timeit numpy.array([tab[indices1], tab[indices2], tab[indices3], tab[indices4]])
100 loops, best of 3: 5.11 ms per loop
In [48]: timeit numpy.concatenate([tab[indices1], tab[indices2], tab[indices3], tab[indices4]])
1000 loops, best of 3: 544 us per loop
(Also np.concatenate gives a (4*n,) array and np.array gives a (4, n) array, where n is the length if indices[1-4]. The latter will only work if the indices1-4 are all the same length.)
And last, you could also save even more time if you can do the following:
indices = np.unique(np.concatenate((indices1, indices2, indices3, indices4)))
result = tab[indices]
Doing it in this order is faster because you reduce the number of indices you need to look up in tab, but it'll only work if you know that the elements of tab are unique (otherwise you could get repeats in result even if the indices are unique).
Hope that helps
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With