I am trying to use drop_duplicates method on my dataframe, but I am getting an error. See the following:
error: TypeError: unhashable type: 'list'
The code I am using:
df = db.drop_duplicates()
My DB is huge and contains strings, floats, dates, NaN's, booleans, integers... Any help is appreciated.
Pandas DataFrame drop_duplicates() Method The drop_duplicates() method removes duplicate rows. Use the subset parameter if only some specified columns should be considered when looking for duplicates.
drop_duplicates won't work with lists in your dataframe as the error message implies. However, you can drop duplicates on the dataframe casted as str and then extract the rows from original df using the index from the results.
Setup
df = pd.DataFrame({'Keyword': {0: 'apply', 1: 'apply', 2: 'apply', 3: 'terms', 4: 'terms'},
'X': {0: [1, 2], 1: [1, 2], 2: 'xy', 3: 'xx', 4: 'yy'},
'Y': {0: 'yy', 1: 'yy', 2: 'yx', 3: 'ix', 4: 'xi'}})
#Drop directly causes the same error
df.drop_duplicates()
Traceback (most recent call last):
...
TypeError: unhashable type: 'list'
Solution
#convert hte df to str type, drop duplicates and then select the rows from original df.
df.loc[df.astype(str).drop_duplicates().index]
Out[205]:
Keyword X Y
0 apply [1, 2] yy
2 apply xy yx
3 terms xx ix
4 terms yy xi
#the list elements are still list in the final results.
df.loc[df.astype(str).drop_duplicates().index].loc[0,'X']
Out[207]: [1, 2]
Edit: replaced iloc with loc. In this particular case, both work as the index matches the positional index, but it is not general
@Allen's answer is great, but have a little problem.
df.iloc[df.astype(str).drop_duplicates().index]
it should be loc not iloc.loot at the example.
a = pd.DataFrame([['a',18],['b',11],['a',18]],index=[4,6,8])
Out[52]:
0 1
4 a 18
6 b 11
8 a 18
a.iloc[a.astype(str).drop_duplicates().index]
Out[53]:
...
IndexError: positional indexers are out-of-bounds
a.loc[a.astype(str).drop_duplicates().index]
Out[54]:
0 1
4 a 18
6 b 11
I also just want to mention (in case someone else is as stupid as I was), that you will get the same error if you mistakenly give a list of lists as the 'subset' argument for the drop_duplicates function.
Turns out I spend hours looking for a list that wasn't in my dataframe all because I put one to many brackets in my parameters.
Overview: you can see which rows are duplicated
Method 1:
df2=df.copy()
mylist=df2.iloc[0,1]
df2.iloc[0,1]=' '.join(map(str,mylist))
mylist=df2.iloc[1,1]
df2.iloc[1,1]=' '.join(map(str,mylist))
duplicates=df2.duplicated(keep=False)
print(df2[duplicates])
Method 2:
print(df.astype(str).duplicated(keep=False))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With