Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas: How can I remove duplicate rows from DataFrame and calculate their frequency?

I have a created a dataframe:

df1 = pd.DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'a', 'b'],
                    'year':[2000,2001,1998,1999,1998,1998,2000]})

That is as follows:

    key    year
0    b    2000  
1    b    2001  
2    a    1998  
3    c    1999  
4    a    1998  
5    a    1998  
6    b    2000  

I want to get the number of occurrences of each line in the fastest possible way:

key  year    frequency  
b    2000    2  
b    2001    1  
a    1998    3  
c    1999    1        
like image 933
Laura Avatar asked Feb 04 '14 17:02

Laura


People also ask

How do I find and remove duplicate rows in pandas?

You can set 'keep=False' in the drop_duplicates() function to remove all the duplicate rows. For E.x, df. drop_duplicates(keep=False) .

How do you count repeated rows in pandas?

You can count the number of duplicate rows by counting True in pandas. Series obtained with duplicated() . The number of True can be counted with sum() method. If you want to count the number of False (= the number of non-duplicate rows), you can invert it with negation ~ and then count True with sum() .

How do you get the frequency count in pandas?

In pandas you can get the count of the frequency of a value that occurs in a DataFrame column by using Series. value_counts() method, alternatively, If you have a SQL background you can also get using groupby() and count() method.


1 Answers

By doing

df1.groupby(['key','year']).size().reset_index()

you get...

  key  year  0
0   a  1998  3
1   b  2000  2
2   b  2001  1
3   c  1999  1

as you see, that column has not been named, so you can do something like

mydf = df1.groupby(['key','year']).size().reset_index()
mydf.rename(columns = {0: 'frequency'}, inplace = True)

mydf

  key  year  frequency
0   a  1998          3
1   b  2000          2
2   b  2001          1
3   c  1999          1

(you can omit the .reset_index() if you want, but in that case you'll need to transform mydf into a dataframe, like so: mydf = pd.DataFrame(mydf), and only then rename the column)

like image 87
mkln Avatar answered Nov 03 '22 03:11

mkln