I have a created a dataframe:
df1 = pd.DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'a', 'b'],
'year':[2000,2001,1998,1999,1998,1998,2000]})
That is as follows:
key year
0 b 2000
1 b 2001
2 a 1998
3 c 1999
4 a 1998
5 a 1998
6 b 2000
I want to get the number of occurrences of each line in the fastest possible way:
key year frequency
b 2000 2
b 2001 1
a 1998 3
c 1999 1
You can set 'keep=False' in the drop_duplicates() function to remove all the duplicate rows. For E.x, df. drop_duplicates(keep=False) .
You can count the number of duplicate rows by counting True in pandas. Series obtained with duplicated() . The number of True can be counted with sum() method. If you want to count the number of False (= the number of non-duplicate rows), you can invert it with negation ~ and then count True with sum() .
In pandas you can get the count of the frequency of a value that occurs in a DataFrame column by using Series. value_counts() method, alternatively, If you have a SQL background you can also get using groupby() and count() method.
By doing
df1.groupby(['key','year']).size().reset_index()
you get...
key year 0
0 a 1998 3
1 b 2000 2
2 b 2001 1
3 c 1999 1
as you see, that column has not been named, so you can do something like
mydf = df1.groupby(['key','year']).size().reset_index()
mydf.rename(columns = {0: 'frequency'}, inplace = True)
mydf
key year frequency
0 a 1998 3
1 b 2000 2
2 b 2001 1
3 c 1999 1
(you can omit the .reset_index()
if you want, but in that case you'll need to transform mydf
into a dataframe, like so: mydf = pd.DataFrame(mydf)
, and only then rename the column)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With