Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas search for duplicate rows in one column which have different values in another column

Tags:

python

pandas

I have a Pandas dataframe df for which I want to find all rows for which the value of column A is the same, but the value of column B different, e.g.:

       | A | B
    ---|---|---
     0 | 2 | x 
     1 | 2 | y 

I know I can use pd.concat(g for _, g in df.groupby('A') if len(g) > 1) to get the rows with duplicate values of A, but how do I add the second constraint?

like image 831
marianne Avatar asked Jan 19 '17 15:01

marianne


People also ask

How do you find duplicate rows in Pandas based on multiple columns?

Select Duplicate Rows Based on All Columns You can use df[df. duplicated()] without any arguments to get rows with the same values on all columns. It takes defaults values subset=None and keep='first' . The below example returns two rows as these are duplicate rows in our DataFrame.

How do you check if there are duplicate rows in Pandas DataFrame?

The pandas. DataFrame. duplicated() method is used to find duplicate rows in a DataFrame. It returns a boolean series which identifies whether a row is duplicate or unique.

How do you compare two columns are the same Pandas?

Method 2: Using equals() methods. This method Test whether two-column contain the same elements. This function allows two Series or DataFrames to be compared against each other to see if they have the same shape and elements. NaNs in the same location are considered equal.

How do you check if there are duplicates in a column Pandas?

To find duplicates on a specific column, we can simply call duplicated() method on the column. The result is a boolean Series with the value True denoting duplicate. In other words, the value True means the entry is identical to a previous one.


2 Answers

Thinking about this, it makes sense to call unique on the groupby:

In [213]:
df = pd.DataFrame({'A':2, 'B':list('xxyzz')})
df

Out[213]:
   A  B
0  2  x
1  2  x
2  2  y
3  2  z
4  2  z

In [229]:
df.groupby('A')['B'].apply(lambda x: x.unique()).reset_index()

Out[229]:
   A          B
0  2  [x, y, z]
like image 136
EdChum Avatar answered Oct 14 '22 03:10

EdChum


df.groupby('A').filter(lambda x: len(x['B'].unique()) > 1)
like image 37
Hezi Zisman Avatar answered Oct 14 '22 04:10

Hezi Zisman