I have a list of strings such as: <pre class="prettyprint"><code>myList = ["paper", "Plastic", "aluminum", "PAPer", "tin", "glass", "tin", "PAPER", "Polypropylene Plastic"] </code></pre> I want this outcome (and this is the only acceptable outcome): <pre class="prettyprint"><code>myList = ["paper", "Plastic", "aluminum", "tin", "glass", "Polypropylene Plastic"] </code></pre> Note that if an item (<code>"Polypropylene Plastic"</code>) happens to contain another item (<code>"Plastic"</code>), I would still like to retain both items. So, the cases can be different, but the item must be a letter-for-letter match, for it to be removed. The original list order must be retained. All duplicates after the first instance of that item should be removed. The original case of that first instance should be preserved, as well as the original cases of all non-duplicate items. I've searched and only found questions that address one need or the other, not both.

It's difficult to code that with a list comprehension (or at the expense of clarity) because of the accumulation/memory effect that you need to filter out duplicates. It's also not possible to use a <code>set</code> comprehension because it destroys the original order. Classic way with a loop and an auxiliary <code>set</code> where you store the lowercase version of the strings you're encountering. Store the string in the result list only if the lowercased version isn't in the set <pre class="prettyprint"><code>myList = ["paper", "Plastic", "aluminum", "PAPer", "tin", "glass", "tin", "PAPER", "Polypropylene Plastic"] result=[] marker = set() for l in myList: ll = l.lower() if ll not in marker: # test presence marker.add(ll) result.append(l) # preserve order print(result) </code></pre> result: <pre class="prettyprint"><code>['paper', 'Plastic', 'aluminum', 'tin', 'glass', 'Polypropylene Plastic'] </code></pre> using <code>.casefold()</code> instead of <code>.lower()</code> allows to handle subtle "casing" differences in some locales (like the german double "s" in Strasse/Straße). Edit: it is possible to do that with a list comprehension, but it's really hacky: <pre class="prettyprint"><code>marker = set() result = [not marker.add(x.casefold()) and x for x in myList if x.casefold() not in marker] </code></pre> It's using <code>and</code> on the <code>None</code> output of <code>set.add</code> to call this function (side effect in a list comprehension, rarely a good thing...), and to return <code>x</code> no matter what. The main disavantages are: <ul> <li>readability</li> <li>the fact that <code>casefold()</code> is called twice, once for testing, once for storing in the marker set</li> </ul>

<pre class="prettyprint"><code>import pandas as pd df=pd.DataFrame(myList) df['lower']=df[0].apply(lambda x: x.lower()) df.groupby('lower',sort=0)[0].first().tolist() </code></pre> output: <pre class="prettyprint"><code>['paper', 'Plastic', 'aluminum', 'tin', 'glass','Polypropylene Plastic'] </code></pre>

How to remove case-insensitive duplicates from a list, while maintaining the original list order?

Tags:

python

list

I have a list of strings such as:

myList = ["paper", "Plastic", "aluminum", "PAPer", "tin", "glass", "tin", "PAPER", "Polypropylene Plastic"]

I want this outcome (and this is the only acceptable outcome):

myList = ["paper", "Plastic", "aluminum", "tin", "glass", "Polypropylene Plastic"]

Note that if an item ("Polypropylene Plastic") happens to contain another item ("Plastic"), I would still like to retain both items. So, the cases can be different, but the item must be a letter-for-letter match, for it to be removed.

The original list order must be retained. All duplicates after the first instance of that item should be removed. The original case of that first instance should be preserved, as well as the original cases of all non-duplicate items.

I've searched and only found questions that address one need or the other, not both.

807

asked Jan 16 '18 14:01

Crickets

2 Answers

It's difficult to code that with a list comprehension (or at the expense of clarity) because of the accumulation/memory effect that you need to filter out duplicates.

It's also not possible to use a set comprehension because it destroys the original order.

Classic way with a loop and an auxiliary set where you store the lowercase version of the strings you're encountering. Store the string in the result list only if the lowercased version isn't in the set

myList = ["paper", "Plastic", "aluminum", "PAPer", "tin", "glass", "tin", "PAPER", "Polypropylene Plastic"]
result=[]

marker = set()

for l in myList:
    ll = l.lower()
    if ll not in marker:   # test presence
        marker.add(ll)
        result.append(l)   # preserve order

print(result)

result:

['paper', 'Plastic', 'aluminum', 'tin', 'glass', 'Polypropylene Plastic']

using .casefold() instead of .lower() allows to handle subtle "casing" differences in some locales (like the german double "s" in Strasse/Straße).

Edit: it is possible to do that with a list comprehension, but it's really hacky:

marker = set()
result = [not marker.add(x.casefold()) and x for x in myList if x.casefold() not in marker]

It's using and on the None output of set.add to call this function (side effect in a list comprehension, rarely a good thing...), and to return x no matter what. The main disavantages are:

readability
the fact that casefold() is called twice, once for testing, once for storing in the marker set

170

answered Oct 18 '22 19:10

Jean-François Fabre

import pandas as pd
df=pd.DataFrame(myList)
df['lower']=df[0].apply(lambda x: x.lower())
df.groupby('lower',sort=0)[0].first().tolist()

output:

['paper', 'Plastic', 'aluminum', 'tin', 'glass','Polypropylene Plastic']

answered Oct 18 '22 19:10

Binyamin Even

Related questions
                            
                                Importing Python modules for Azure Function
                            
                                what's the usage of __traceback_hide__
                            
                                R's order equivalent in python
                            
                                F test with python, finding the critical value
                            
                                I cannot ignore pycache and db.sqlite on Django even though it refers them at .gitignore
                            
                                Swapping/Ordering multi-index columns in pandas
                            
                                python map() on zipped object
                            
                                What is the difference between var, cvar and ivar in python's sphinx?
                            
                                python fuzzywuzzy's process.extract(): how does it work?
                            
                                Repeating letters like excel columns?
                            
                                Resample Daily Data to Monthly with Pandas (date formatting)
                            
                                IB API Python sample not using Ibpy
                            
                                Combining cv2.imshow() with matplotlib plt.show() in real time
                            
                                Numpy diff inverted operation?
                            
                                How to make numpy array column sum up to 1
                            
                                why UniqueConstraint doesn't work in flask_sqlalchemy
                            
                                Why "numpy.any" has no short-circuit mechanism?
                            
                                Can Pandas perform row-wise min() and max() functions?
                            
                                How to copy a file from host to container using docker-py (docker SDK)
                            
                                Django test Client submitting a form with a POST request

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With