Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to remove case-insensitive duplicates from a list, while maintaining the original list order?

Tags:

python

list

I have a list of strings such as:

myList = ["paper", "Plastic", "aluminum", "PAPer", "tin", "glass", "tin", "PAPER", "Polypropylene Plastic"]

I want this outcome (and this is the only acceptable outcome):

myList = ["paper", "Plastic", "aluminum", "tin", "glass", "Polypropylene Plastic"]

Note that if an item ("Polypropylene Plastic") happens to contain another item ("Plastic"), I would still like to retain both items. So, the cases can be different, but the item must be a letter-for-letter match, for it to be removed.

The original list order must be retained. All duplicates after the first instance of that item should be removed. The original case of that first instance should be preserved, as well as the original cases of all non-duplicate items.

I've searched and only found questions that address one need or the other, not both.

like image 807
Crickets Avatar asked Jan 16 '18 14:01

Crickets


People also ask

How do you remove duplicates from a list whilst preserving order?

If you want to preserve the order while you remove duplicate elements from List in Python, you can use the OrderedDict class from the collections module. More specifically, we can use OrderedDict. fromkeys(list) to obtain a dictionary having duplicate elements removed, while still maintaining order.

How do I remove duplicates in case insensitive in Excel?

If you have Kutools for Excel, with its Select Duplicate & Unique Cells utility, you can quickly select or highlight the duplicate values that are case sensitive and then remove them at once.

How do I remove duplicates from a list set?

Remove duplicates from list using Set. To remove the duplicates from a list, you can make use of the built-in function set(). The specialty of set() method is that it returns distinct elements. We have a list : [1,1,2,3,2,2,4,5,6,2,1].


2 Answers

It's difficult to code that with a list comprehension (or at the expense of clarity) because of the accumulation/memory effect that you need to filter out duplicates.

It's also not possible to use a set comprehension because it destroys the original order.

Classic way with a loop and an auxiliary set where you store the lowercase version of the strings you're encountering. Store the string in the result list only if the lowercased version isn't in the set

myList = ["paper", "Plastic", "aluminum", "PAPer", "tin", "glass", "tin", "PAPER", "Polypropylene Plastic"]
result=[]

marker = set()

for l in myList:
    ll = l.lower()
    if ll not in marker:   # test presence
        marker.add(ll)
        result.append(l)   # preserve order

print(result)

result:

['paper', 'Plastic', 'aluminum', 'tin', 'glass', 'Polypropylene Plastic']

using .casefold() instead of .lower() allows to handle subtle "casing" differences in some locales (like the german double "s" in Strasse/Straße).

Edit: it is possible to do that with a list comprehension, but it's really hacky:

marker = set()
result = [not marker.add(x.casefold()) and x for x in myList if x.casefold() not in marker]

It's using and on the None output of set.add to call this function (side effect in a list comprehension, rarely a good thing...), and to return x no matter what. The main disavantages are:

  • readability
  • the fact that casefold() is called twice, once for testing, once for storing in the marker set
like image 170
Jean-François Fabre Avatar answered Oct 18 '22 19:10

Jean-François Fabre


import pandas as pd
df=pd.DataFrame(myList)
df['lower']=df[0].apply(lambda x: x.lower())
df.groupby('lower',sort=0)[0].first().tolist()

output:

['paper', 'Plastic', 'aluminum', 'tin', 'glass','Polypropylene Plastic']
like image 2
Binyamin Even Avatar answered Oct 18 '22 19:10

Binyamin Even