I'm trying to clean a list, by removing duplicates. For example:
bb = ['Gppe (Aspirin Combined)',
'Gppe Cap (Migraine)',
'Gppe Tab',
'Abilify',
'Abilify Maintena',
'Abstem',
'Abstral']
Ideally, I need to get the following list:
bb = ['Gppe',
'Abilify',
'Abstem',
'Abstral']
What I tried:
Split the list and remove duplicates (a naive approach)
list(set(sorted([j for bb_i in bb for j in bb_i.split(' ')])))
which leaves a lot of 'rubbish':
['(Aspirin',
'(Migraine)',
'Abilify',
'Abstem',
'Abstral',
'Cap',
'Combined)',
'Gppe',
'Maintena',
'Tab']
Counter(['Gppe (Aspirin Combined)', 'Gppe Cap (Migraine)', 'Gppe Tab').most_common(1)[0][0]
But I'm not sure how to find similar words (a group)??
I am wondering, whether one can use a kind of 'groupby()' and first group by names and then remove duplicates within those names.
You could do, assuming you want the unique first word of each string:
bb = ['Gppe (Aspirin Combined)',
'Gppe Cap (Migraine)',
'Gppe Tab',
'Abilify',
'Abilify Maintena',
'Abstem',
'Abstral']
result = set(map(lambda x: x.split()[0], bb))
print(result)
Output
{'Gppe', 'Abstral', 'Abilify', 'Abstem'}
If you want a list of unique elements in the order of appearance, you could do:
bb = ['Gppe (Aspirin Combined)',
'Gppe Cap (Migraine)',
'Gppe Tab',
'Abilify',
'Abilify Maintena',
'Abstem',
'Abstral']
seen = set()
result = []
for e in bb:
key = e.split()[0]
if key not in seen:
result.append(key)
seen.add(key)
print(result)
Output
['Gppe', 'Abilify', 'Abstem', 'Abstral']
As an alternative to the first solution you could do:
{x.split()[0] for x in bb}
set(x.split()[0] for x in bb)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With