Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Grouping by multiple columns to find duplicate rows pandas

Tags:

I have a df

id    val1    val2  1     1.1      2.2  1     1.1      2.2  2     2.1      5.5  3     8.8      6.2  4     1.1      2.2  5     8.8      6.2 

I want to group by val1 and val2 and get similar dataframe only with rows which has multiple occurance of same val1 and val2 combination.

Final df:

id    val1    val2  1     1.1      2.2  4     1.1      2.2  3     8.8      6.2  5     8.8      6.2 
like image 470
Shubham R Avatar asked Oct 09 '17 07:10

Shubham R


People also ask

How do you find duplicate rows in pandas based on multiple columns?

Find Duplicate Rows based on all columns To find & select the duplicate all rows based on all columns call the Daraframe. duplicate() without any subset argument. It will return a Boolean series with True at the place of each duplicated rows except their first occurrence (default value of keep argument is 'first').

How do I find duplicate rows in pandas?

The pandas. DataFrame. duplicated() method is used to find duplicate rows in a DataFrame. It returns a boolean series which identifies whether a row is duplicate or unique.

Can you group by multiple columns in pandas?

Pandas comes with a whole host of sql-like aggregation functions you can apply when grouping on one or more columns. This is Python's closest equivalent to dplyr's group_by + summarise logic.


1 Answers

You need duplicated with parameter subset for specify columns for check with keep=False for all duplicates for mask and filter by boolean indexing:

df = df[df.duplicated(subset=['val1','val2'], keep=False)] print (df)    id  val1  val2 0   1   1.1   2.2 1   1   1.1   2.2 3   3   8.8   6.2 4   4   1.1   2.2 5   5   8.8   6.2 

Detail:

print (df.duplicated(subset=['val1','val2'], keep=False)) 0     True 1     True 2    False 3     True 4     True 5     True dtype: bool 
like image 155
jezrael Avatar answered Oct 10 '22 14:10

jezrael