I am attempting to use filter on a pandas dataframe to filter out all rows that match a duplicate value(need to remove ALL the rows when there are duplicates, not just the first or last).
This is what I have that works in the editor :
df = df.groupby("student_id").filter(lambda x: x.count() == 1)
But when I run my script with this code in it I get the error:
TypeError: filter function returned a Series, but expected a scalar bool
I am creating the dataframe by concatenating two other frames immediately before trying to apply the filter.
it should be:
In [32]: grouped = df.groupby("student_id")
In [33]: grouped.filter(lambda x: x["student_id"].count()==1)
Updates:
i'm not sure about the issue u mentioned regarding the interactive console. technically speaking in this particular case (there might be other situations such as the intricate "import" functionality in which diff env may behave differently), the console (such as ipython) should behave the same as other environment (orig python env, or some IDE embedded one)
an intuitive way to understand the pandas groupby is to treat the return obj of DataFrame.groupby() as a list of dataframe. so when u try to using filter to apply the lambda function upon x, x is actually one of those dataframes:
In[25]: df = pd.DataFrame(data,columns=year)
In[26]: df
Out[26]:
2013 2014
0 0 1
1 2 3
2 4 5
3 6 7
4 0 1
5 2 3
6 4 5
7 6 7
In[27]: grouped = df.groupby(2013)
In[28]: grouped.count()
Out[28]:
2014
2013
0 2
2 2
4 2
6 2
in this example, the first dataframe in the grouped obj would be:
In[33]: df1 = df.ix[[0,4]]
In[34]: df1
Out[33]:
2013 2014
0 0 1
4 0 1
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With