Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python: Remove duplicates for a specific item from list

I have a list of item, where I want to remove the occurrence of any duplicates for one item, but keep any duplicates for the rest. I.e. I start with the following list

mylist = [4, 1, 2, 6, 1, 0, 9, 8, 0, 9]

I want to remove any duplicates of 0 but keep the duplicates of 1 and 9. My current solution is the following:

mylist = [i for i in mylist if i != 0]
mylist.add(0)

Is there a nice way of keeping one occurrence of 0 besides the following?

for i in mylist:
    if mylist.count(0) > 1:
        mylist.remove(0)

The second approach takes more than double the time for this example.

Clarification:

  • currently, I don't care about the order of items in the list, as I currently sort it after it has been created and cleaned, but that might change later.

  • currently, I only need to remove duplicates for one specific item (that is 0 in my example)

like image 264
Cryn Avatar asked Apr 07 '18 12:04

Cryn


3 Answers

The solution:

[0] + [i for i in mylist if i]

looks good enough, except if 0 is not in mylist, in which case you're wrongly adding 0.

Besides, adding 2 lists like this isn't very good performance wise. I'd do:

newlist = [i for i in mylist if i]
if len(newlist) != len(mylist):  # 0 was removed, add it back
   newlist.append(0)

(or using filter newlist = list(filter(None,mylist)) which could be slightly faster because there are no native python loops)

Appending to a list at the last position is very efficient (list object uses pre-allocation and most of the time no memory is copied). The length test trick is O(1) and allows to avoid to test 0 in mylist

like image 190
Jean-François Fabre Avatar answered Nov 12 '22 06:11

Jean-François Fabre


It sounds like a better data structure for you to use would be collections.Counter (which is in the standard library):

import collections

counts = collections.Counter(mylist)
counts[0] = 1
mylist = list(counts.elements())
like image 40
Daniel Pryden Avatar answered Nov 12 '22 06:11

Daniel Pryden


Here is a generator-based approach with approximately O(n) complexity that also preserves the order of the original list:

In [62]: def remove_dup(lst, item):
    ...:     temp = [item]
    ...:     for i in lst:
    ...:         if i != item:
    ...:             yield i
    ...:         elif i == item and temp:
    ...:             yield temp.pop()
    ...:             

In [63]: list(remove_dup(mylist, 0))
Out[63]: [4, 1, 2, 6, 1, 0, 9, 8, 9]

Also if you are dealing with larger lists you can use following vectorized and optimized approach using Numpy:

In [80]: arr = np.array([4, 1, 2, 6, 1, 0, 9, 8, 0, 9])

In [81]: mask = arr == 0

In [82]: first_ind = np.where(mask)[0][0]

In [83]: mask[first_ind] = False

In [84]: arr[~mask]
Out[84]: array([4, 1, 2, 6, 1, 0, 9, 8, 9])
like image 42
Mazdak Avatar answered Nov 12 '22 05:11

Mazdak