I have updated my question to provide a clearer example.
Is it possible to use the drop_duplicates method in Pandas to remove duplicate rows based on a column id where the values contain a list. Consider column 'three' which consists of two items in a list. Is there a way to drop the duplicate rows rather than doing it iteratively (which is my current workaround).
I have outlined my problem by providing the following example:
import pandas as pd
data = [
{'one': 50, 'two': '5:00', 'three': 'february'},
{'one': 25, 'two': '6:00', 'three': ['february', 'january']},
{'one': 25, 'two': '6:00', 'three': ['february', 'january']},
{'one': 25, 'two': '6:00', 'three': ['february', 'january']},
{'one': 90, 'two': '9:00', 'three': 'january'}
]
df = pd.DataFrame(data)
print(df)
one three two
0 50 february 5:00
1 25 [february, january] 6:00
2 25 [february, january] 6:00
3 25 [february, january] 6:00
4 90 january 9:00
df.drop_duplicates(['three'])
Results in the following error:
TypeError: type object argument after * must be a sequence, not map
I think it's because the list type isn't hashable and that's messing up the duplicated logic. As a workaround you could cast to tuple like so:
df['four'] = df['three'].apply(lambda x : tuple(x) if type(x) is list else x)
df.drop_duplicates('four')
one three two four
0 50 february 5:00 february
1 25 [february, january] 6:00 (february, january)
4 90 january 9:00 january
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With