Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to count duplicate rows in pandas dataframe?

Tags:

python

pandas

I am trying to count the duplicates of each type of row in my dataframe. For example, say that I have a dataframe in pandas as follows:

df = pd.DataFrame({'one': pd.Series([1., 1, 1]),                    'two': pd.Series([1., 2., 1])}) 

I get a df that looks like this:

    one two 0   1   1 1   1   2 2   1   1 

I imagine the first step is to find all the different unique rows, which I do by:

df.drop_duplicates() 

This gives me the following df:

    one two 0   1   1 1   1   2 

Now I want to take each row from the above df ([1 1] and [1 2]) and get a count of how many times each is in the initial df. My result would look something like this:

Row     Count [1 1]     2 [1 2]     1 

How should I go about doing this last step?

Edit:

Here's a larger example to make it more clear:

df = pd.DataFrame({'one': pd.Series([True, True, True, False]),                    'two': pd.Series([True, False, False, True]),                    'three': pd.Series([True, False, False, False])}) 

gives me:

    one three   two 0   True    True    True 1   True    False   False 2   True    False   False 3   False   False   True 

I want a result that tells me:

       Row           Count [True True True]       1 [True False False]     2 [False False True]     1 
like image 460
jss367 Avatar asked Feb 23 '16 17:02

jss367


People also ask

How can I count duplicate rows in Pandas?

You can count the number of duplicate rows by counting True in pandas. Series obtained with duplicated() . The number of True can be counted with sum() method. If you want to count the number of False (= the number of non-duplicate rows), you can invert it with negation ~ and then count True with sum() .

How do you count duplicates in a DataFrame column?

You can use groupby with function size. Then I reset index with rename column 0 to count .

How do you check if there are duplicates in Pandas DataFrame?

Finding duplicate rows To take a look at the duplication in the DataFrame as a whole, just call the duplicated() method on the DataFrame. It outputs True if an entire row is identical to a previous row.


2 Answers

You can groupby on all the columns and call size the index indicates the duplicate values:

In [28]: df.groupby(df.columns.tolist(),as_index=False).size()  Out[28]: one    three  two   False  False  True     1 True   False  False    2        True   True     1 dtype: int64 
like image 93
EdChum Avatar answered Oct 02 '22 12:10

EdChum


df.groupby(df.columns.tolist()).size().reset_index().\     rename(columns={0:'records'})     one  two  records 0    1    1        2 1    1    2        1 
like image 29
Denis Avatar answered Oct 02 '22 11:10

Denis