Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas Drop Duplicates Series Hashing Error

Tags:

python

pandas

I have created a pandas dataframe but when dropping duplicate rows I am given the error:

TypeError: 'Series' objects are mutable, thus they cannot be hashed

This happens when I run:

print(type(data)) # <class 'pandas.core.frame.DataFrame'> check that it's not a series
data.drop_duplicates(subset=['statement'], inplace=True)
print(data.info())

Info returns this:

> class 'pandas.core.frame.DataFrame'
> Int64Index: 39671 entries, 0 to 39670
> Data columns (total 4 columns):
> statement          39671 non-null object
> topic_direction    39671 non-null object
> topic              39671 non-null object
> direction          39671 non-null object
> dtypes: object(4)
> memory usage: 1.5+ MB
> None
like image 765
Jacob B Avatar asked Aug 27 '18 20:08

Jacob B


People also ask

How do you drop duplicates in pandas with conditions?

Pandas drop_duplicates() Function Syntax keep: allowed values are {'first', 'last', False}, default 'first'. If 'first', duplicate rows except the first one is deleted. If 'last', duplicate rows except the last one is deleted. If False, all the duplicate rows are deleted.

How do I reset index after dropping duplicates?

Drop duplicates and reset the index When we drop the rows from DataFrame, by default, it keeps the original row index as is. But, if we need to reset the index of the resultant DataFrame, we can do that using the ignore_index parameter of DataFrame. drop_duplicate() .

What does drop duplicates do in pandas?

The drop_duplicates() method removes duplicate rows. Use the subset parameter if only some specified columns should be considered when looking for duplicates.


1 Answers

the individual elements in your 'statement' column are pandas.Series. That is a clear sign that things have gone astray. You can validate my claim by running data['statement'].apply(type) you should see a bunch of <pandas.Series> or something similar.

If you're stuck with the situation, try

df[~df['statement'].apply(tuple).duplicated()]

This forces each element of the 'statement' column to be a tuple which is hashable. Then you can find the duplicate rows and filter.

like image 99
piRSquared Avatar answered Oct 10 '22 00:10

piRSquared