Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fast duplicates removal in numpy and python

Is there any fast way to obtain unique elements in numpy? I have code similar to this (the last line)

tab = numpy.arange(100000000)

indices1 = numpy.random.permutation(10000)
indices2 = indices1.copy()
indices3 = indices1.copy()
indices4 = indices1.copy()

result = numpy.unique(numpy.array([tab[indices1], tab[indices2], tab[indices3], tab[indices4]]))

This is just an example and in my situation indices1, indices2,...,indices4 contains different set of indices and have various size. The last line is executed many times and Inoticed that it's actually the bottleneck in my code ({numpy.core.multiarray.arange} to be precesive). Besides, ordering is not important and element in indices array are of int32 type. I was thinking about using hashtable with element value as key and tried:

seq = itertools.chain(tab[indices1].flatten(), tab[indices2].flatten(), tab[indices3].flatten(), tab[indices4].flatten())
myset = {}
map(myset.__setitem__, seq, [])
result = numpy.array(myset.keys())

but it was even worse.

Is there any way to speed this up? I guess the performance penalty comes from 'fancy indexing' that copy the array but I need the resulting element only to read (I don't modify anything).

like image 490
pzo Avatar asked Dec 23 '11 20:12

pzo


People also ask

What is faster than NumPy?

pandas provides a bunch of C or Cython optimized functions that can be faster than the NumPy equivalent function (e.g. reading text from text files).

Why NumPy is faster than Python?

NumPy Arrays are faster than Python Lists because of the following reasons: An array is a collection of homogeneous data-types that are stored in contiguous memory locations. On the other hand, a list in Python is a collection of heterogeneous data types stored in non-contiguous memory locations.


2 Answers

[What follows is actually partially incorrect (see the PS):]

The following way of obtaining the unique elements in all sub-arrays is very fast:

seq = itertools.chain(tab[indices1].flat, tab[indices2].flat, tab[indices3].flat, tab[indices4].flat)
result = set(seq)

Note that flat (which returns an iterator) is used instead of flatten() (which returns a full array), and that set() can be called directly (instead of using map() and a dictionary, as in your second method).

Here are the timing results (obtained in the IPython shell):

>>> %timeit result = numpy.unique(numpy.array([tab[indices1], tab[indices2], tab[indices3], tab[indices4]]))
100 loops, best of 3: 8.04 ms per loop
>>> seq = itertools.chain(tab[indices1].flat, tab[indices2].flat, tab[indices3].flat, tab[indices4].flat)
>>> %timeit set(seq)
1000000 loops, best of 3: 223 ns per loop

The set/flat method is thus 40 times faster on this example.

PS: The timing of set(seq) is actually not representative. In fact, the first loop of the timing empties the seq iterator and the subsequent set() evaluations return an empty set! The correct timing test is the following

>>> %timeit set(itertools.chain(tab[indices1].flat, tab[indices2].flat, tab[indices3].flat, tab[indices4].flat))
100 loops, best of 3: 9.12 ms per loop

which shows that the set/flat method is actually not faster.

PPS: here is an (unsuccessful) exploration of mtrw's suggestion; finding the unique indices beforehand might be a good idea, but I can't find a way to implement it which is faster than the above approach:

>>> %timeit set(indices1).union(indices2).union(indices3).union(indices4)
100 loops, best of 3: 11.9 ms per loop
>>> %timeit set(itertools.chain(indices1.flat, indices2.flat, indices3.flat, indices4.flat))
100 loops, best of 3: 10.8 ms per loop

Thus, finding the set of all distinct indices is itself quite slow.

PPPS: numpy.unique(<concatenated array of indices>) is actually 2-3 times faster than set(<concatenated array of indices>). This is key to the speed up obtained in Bago's answer (unique(concatenate((…)))). The reason is probably that letting NumPy handle its arrays by itself is generally faster than interfacing pure Python (set) with NumPy arrays.

Conclusion: this answer therefore only documents failed attempts that should not be fully followed, as well as a possibly useful remark about timing code with iterators…

like image 56
Eric O Lebigot Avatar answered Oct 04 '22 11:10

Eric O Lebigot


Sorry I don't completely understand your question, but I'll do my best to help.

Fist {numpy.core.multiarray.arange} is numpy.arange not fancy indexing, unfortunately fancy indexing does not show up as a separate line item in the profiler. If you're calling np.arange in the loop you, should see if you can move it outside.

In [27]: prun tab[tab]
     2 function calls in 1.551 CPU seconds

Ordered by: internal time

ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    1    1.551    1.551    1.551    1.551 <string>:1(<module>)
    1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler'    objects}

In [28]: prun numpy.arange(10000000)
     3 function calls in 0.051 CPU seconds

Ordered by: internal time

ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    1    0.047    0.047    0.047    0.047 {numpy.core.multiarray.arange}
    1    0.003    0.003    0.051    0.051 <string>:1(<module>)
    1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}

Second I assume that tab is not np.arange(a, b) in your code, because if it is than tab[index] == index + a, but I assume that was just for your example.

Third, np.concatenate is about 10 times faster than np.array

In [47]: timeit numpy.array([tab[indices1], tab[indices2], tab[indices3], tab[indices4]])
100 loops, best of 3: 5.11 ms per loop

In [48]: timeit numpy.concatenate([tab[indices1], tab[indices2], tab[indices3],     tab[indices4]])
1000 loops, best of 3: 544 us per loop

(Also np.concatenate gives a (4*n,) array and np.array gives a (4, n) array, where n is the length if indices[1-4]. The latter will only work if the indices1-4 are all the same length.)

And last, you could also save even more time if you can do the following:

indices = np.unique(np.concatenate((indices1, indices2, indices3, indices4)))
result = tab[indices]

Doing it in this order is faster because you reduce the number of indices you need to look up in tab, but it'll only work if you know that the elements of tab are unique (otherwise you could get repeats in result even if the indices are unique).

Hope that helps

like image 36
Bi Rico Avatar answered Oct 04 '22 11:10

Bi Rico