I have an ordinary Python list that contains (multidimensional) numPy arrays, all of the same shape and with the same number of values. Some of the arrays in the list are duplicates of earlier ones.
I have the problem that I want to remove all the duplicates, but the fact that the data type is numPy arrays complicates this a bit...
• I can't use set() as numPy arrays are not hashable.
• I can't check for duplicates during insertion, as the arrays are generated in batches by a function and added to the list with .extend().
• numPy arrays aren't directly comparable without resorting to one of numPy's own functions, so I can't just go something that uses "if x in list"...
• The contents of the list need to remain numPy arrays at the end of the process; I could compare copies of the arrays converted to nested lists, but I can't convert the arrays to straight python lists permanently.
Any suggestions on how I can remove duplicates efficiently here?
Create a dictionary, using the List items as keys. This will automatically remove any duplicates because dictionaries cannot have duplicate keys.
You can remove duplicates from a Python using the dict. fromkeys(), which generates a dictionary that removes any duplicate values.
We can remove duplicate element in an array by 2 ways: using temporary array or using separate index. To remove the duplicate element from array, the array must be in sorted order. If array is not sorted, you can sort it by calling Arrays. sort(arr) method.
Using the solutions here: Most efficient property to hash for numpy array we see that hashing works best with a.tostring() if a is an numpy array. So:
import numpy as np
arraylist = [np.array([1,2,3,4]), np.array([1,2,3,4]), np.array([1,3,2,4])]
L = {array.tostring(): array for array in arraylist}
L.values() # [array([1, 3, 2, 4]), array([1, 2, 3, 4])]
Depending on the structure of your data, it may be quicker to directly compare all the arrays rather than finding some way to hash the arrays. The algorithm is O(n^2), but each individual comparison wil be much quicker than creating strings or python lists of your arrays. So it depends how many arrays you have to check.
eg.
uniques = []
for arr in possible_duplicates:
if not any(numpy.array_equal(arr, unique_arr) for unique_arr in uniques):
uniques.append(arr)
Here is one way using tuple
:
>>> import numpy as np
>>> t = [np.asarray([1, 2, 3, 4]),
np.asarray([1, 2, 3, 4]),
np.asarray([1, 1, 3, 4])]
>>> map(np.asarray, set(map(tuple, t)))
[array([1, 1, 3, 4]), array([1, 2, 3, 4])]
If your arrays are multidimensional, then first flatten them to a 1-by-whatever array, then use the same idea, and reshape them at the end:
def to_tuple(arr):
return tuple(arr.reshape((arr.size,)))
def from_tuple(tup, original_shape):
np.asarray(tup).reshape(original_shape)
Example:
In [64]: t = np.asarray([[[1,2,3],[4,5,6]],
[[1,1,3],[4,5,6]],
[[1,2,3],[4,5,6]]])
In [65]: map(lambda x: from_tuple(x, t[0].shape), set(map(to_tuple, t)))
Out[65]:
[array([[1, 2, 3],
[4, 5, 6]]),
array([[1, 1, 3],
[4, 5, 6]])]
Another option is to create a pandas.DataFrame
out of your ndarrays
(treating them as rows by reshaping if needed) and using the pandas built-ins for uniquifying the rows.
In [34]: t
Out[34]: [array([1, 2, 3, 4]), array([1, 2, 3, 4]), array([1, 1, 3, 4])]
In [35]: pandas.DataFrame(t).drop_duplicates().values
Out[35]:
array([[1, 2, 3, 4],
[1, 1, 3, 4]])
Overall, I think it's a bad idea to try to use tostring()
as a quasi hash function, because you'll need more boiler plate code than in my approach just to protect against the possibility that some of the contents are mutated after they've been assigned their "hash" key in some dict
.
If the reshaping and converting to tuple
is too slow given the size of the data, my feeling is that this reveals a more fundamental problem: the application isn't designed well around the needs (like de-duping) and trying to cram them into some Python process running in memory is probably not the right way. At that point, I would stop to consider whether something like Cassandra, which can easily build database indices on top of large columns (or multidimensional arrays) of floating point (or other) data isn't the more sane approach.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With