Removing duplicates from a list of numPy arrays

Tags:

I have an ordinary Python list that contains (multidimensional) numPy arrays, all of the same shape and with the same number of values. Some of the arrays in the list are duplicates of earlier ones.

I have the problem that I want to remove all the duplicates, but the fact that the data type is numPy arrays complicates this a bit...

• I can't use set() as numPy arrays are not hashable.
• I can't check for duplicates during insertion, as the arrays are generated in batches by a function and added to the list with .extend().
• numPy arrays aren't directly comparable without resorting to one of numPy's own functions, so I can't just go something that uses "if x in list"...
• The contents of the list need to remain numPy arrays at the end of the process; I could compare copies of the arrays converted to nested lists, but I can't convert the arrays to straight python lists permanently.

Any suggestions on how I can remove duplicates efficiently here?

893

asked Jan 03 '15 02:01

SoItBegins

3 Answers

Using the solutions here: Most efficient property to hash for numpy array we see that hashing works best with a.tostring() if a is an numpy array. So:

import numpy as np
arraylist = [np.array([1,2,3,4]), np.array([1,2,3,4]), np.array([1,3,2,4])]
L = {array.tostring(): array for array in arraylist}
L.values() # [array([1, 3, 2, 4]), array([1, 2, 3, 4])]

197

answered Sep 30 '22 00:09

Joel

Depending on the structure of your data, it may be quicker to directly compare all the arrays rather than finding some way to hash the arrays. The algorithm is O(n^2), but each individual comparison wil be much quicker than creating strings or python lists of your arrays. So it depends how many arrays you have to check.

eg.

uniques = []
for arr in possible_duplicates:
    if not any(numpy.array_equal(arr, unique_arr) for unique_arr in uniques):
        uniques.append(arr)

answered Sep 30 '22 01:09

Dunes

Here is one way using tuple:

>>> import numpy as np
>>> t = [np.asarray([1, 2, 3, 4]), 
         np.asarray([1, 2, 3, 4]), 
         np.asarray([1, 1, 3, 4])]

>>> map(np.asarray, set(map(tuple, t)))
[array([1, 1, 3, 4]), array([1, 2, 3, 4])]

If your arrays are multidimensional, then first flatten them to a 1-by-whatever array, then use the same idea, and reshape them at the end:

def to_tuple(arr):
    return tuple(arr.reshape((arr.size,)))

def from_tuple(tup, original_shape):
    np.asarray(tup).reshape(original_shape)

Example:

In [64]: t = np.asarray([[[1,2,3],[4,5,6]],
                         [[1,1,3],[4,5,6]],
                         [[1,2,3],[4,5,6]]])

In [65]: map(lambda x: from_tuple(x, t[0].shape), set(map(to_tuple, t)))
Out[65]: 
[array([[1, 2, 3],
        [4, 5, 6]]), 
 array([[1, 1, 3],
        [4, 5, 6]])]

Another option is to create a pandas.DataFrame out of your ndarrays (treating them as rows by reshaping if needed) and using the pandas built-ins for uniquifying the rows.

In [34]: t
Out[34]: [array([1, 2, 3, 4]), array([1, 2, 3, 4]), array([1, 1, 3, 4])]

In [35]: pandas.DataFrame(t).drop_duplicates().values
Out[35]: 
array([[1, 2, 3, 4],
       [1, 1, 3, 4]])

Overall, I think it's a bad idea to try to use tostring() as a quasi hash function, because you'll need more boiler plate code than in my approach just to protect against the possibility that some of the contents are mutated after they've been assigned their "hash" key in some dict.

If the reshaping and converting to tuple is too slow given the size of the data, my feeling is that this reveals a more fundamental problem: the application isn't designed well around the needs (like de-duping) and trying to cram them into some Python process running in memory is probably not the right way. At that point, I would stop to consider whether something like Cassandra, which can easily build database indices on top of large columns (or multidimensional arrays) of floating point (or other) data isn't the more sane approach.

answered Sep 30 '22 01:09

ely

Related questions
                            
                                Storing the columns of a spreadsheet in a Python dictionary
                            
                                Flask and SQLAlchemy, application not registered on instance
                            
                                PIL Image.size returns the opposite width/height
                            
                                Flask sqlalchemy check whether object in db.session and ready for commit
                            
                                How to specify a parameter of type Array into a Django Command?
                            
                                How do I include a lot of vars into Ansible roles
                            
                                PsychoPy sending triggers on 64bit OS
                            
                                API capture all paginated data? (python)
                            
                                django-allauth, how can I only allow signup/login through social?
                            
                                ctypes in python size with the `sys.getsizeof(Var)` method vs `ctypes.sizeof(Var)`
                            
                                UnboundLocalError: local variable 'x' referenced before assignment. Proper use of tsplot in seaborn package for a dataframe?
                            
                                Shortest distance between one point and a group of others? [duplicate]
                            
                                Use python 2 shelf in python 3
                            
                                Simplifying the French POS Tag Set with NLTK
                            
                                SQL Query results in tkinter
                            
                                Logging django.request to file instead of console
                            
                                Search Bing via Azure API using Python
                            
                                timestamp column in sqlite return string in python
                            
                                python: flatten to a list of lists but no more
                            
                                how to get a 2d numpy array from a pandas dataframe? - wrong shape

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Removing duplicates from a list of numPy arrays

Tags:

python

arrays

list

numpy

duplicate-removal

SoItBegins

People also ask

3 Answers

Joel

Dunes

ely

Recent Activity

Donate For Us