Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

pandas: drop duplicates in groupby 'date'

In the dataframe below, I would like to eliminate the duplicate cid values so the output from df.groupby('date').cid.size() matches the output from df.groupby('date').cid.nunique().

I have looked at this post but it does not seem to have a solid solution to the problem.

df = pd.read_csv('https://raw.githubusercontent.com/108michael/ms_thesis/master/crsp.dime.mpl.df')

df.groupby('date').cid.size()

date
2005       7
2006     237
2007    3610
2008    1318
2009    2664
2010     997
2011    6390
2012    2904
2013    7875
2014    3979

df.groupby('date').cid.nunique()

date
2005      3
2006     10
2007    227
2008     52
2009    142
2010     57
2011    219
2012     99
2013    238
2014    146
Name: cid, dtype: int64

Things I tried:

  1. df.groupby([df['date']]).drop_duplicates(cols='cid') gives this error: AttributeError: Cannot access callable attribute 'drop_duplicates' of 'DataFrameGroupBy' objects, try using the 'apply' method
  2. df.groupby(('date').drop_duplicates('cid')) gives this error: AttributeError: 'str' object has no attribute 'drop_duplicates'
like image 722
Collective Action Avatar asked May 08 '16 22:05

Collective Action


People also ask

How do I get rid of duplicates in pandas?

Remove All Duplicate Rows from Pandas DataFrame You can set 'keep=False' in the drop_duplicates() function to remove all the duplicate rows. For E.x, df. drop_duplicates(keep=False) .

How do you drop duplicates in pandas based on one column?

To remove duplicates of only one or a subset of columns, specify subset as the individual column or list of columns that should be unique. To do this conditional on a different column's value, you can sort_values(colname) and specify keep equals either first or last .

Does pandas drop duplicates keep first?

Only consider certain columns for identifying duplicates, by default use all of the columns. Determines which duplicates (if any) to keep. - first : Drop duplicates except for the first occurrence.

How do you eliminate duplicate rows in Python?

Dropping duplicate rows We can use Pandas built-in method drop_duplicates() to drop duplicate rows. Note that we started out as 80 rows, now it's 77. By default, this method returns a new DataFrame with duplicate rows removed. We can set the argument inplace=True to remove duplicates from the original DataFrame.


1 Answers

You don't need groupby to drop duplicates based on a few columns, you can specify a subset instead:

df2 = df.drop_duplicates(["date", "cid"])
df2.groupby('date').cid.size()
Out[99]: 
date
2005      3
2006     10
2007    227
2008     52
2009    142
2010     57
2011    219
2012     99
2013    238
2014    146
dtype: int64
like image 177
ayhan Avatar answered Oct 04 '22 15:10

ayhan