I have a Pandas dataframe df
for which I want to find all rows for which the value of column A
is the same, but the value of column B
different, e.g.:
| A | B
---|---|---
0 | 2 | x
1 | 2 | y
I know I can use pd.concat(g for _, g in df.groupby('A') if len(g) > 1)
to get the rows with duplicate values of A
, but how do I add the second constraint?
Select Duplicate Rows Based on All Columns You can use df[df. duplicated()] without any arguments to get rows with the same values on all columns. It takes defaults values subset=None and keep='first' . The below example returns two rows as these are duplicate rows in our DataFrame.
The pandas. DataFrame. duplicated() method is used to find duplicate rows in a DataFrame. It returns a boolean series which identifies whether a row is duplicate or unique.
Method 2: Using equals() methods. This method Test whether two-column contain the same elements. This function allows two Series or DataFrames to be compared against each other to see if they have the same shape and elements. NaNs in the same location are considered equal.
To find duplicates on a specific column, we can simply call duplicated() method on the column. The result is a boolean Series with the value True denoting duplicate. In other words, the value True means the entry is identical to a previous one.
Thinking about this, it makes sense to call unique
on the groupby
:
In [213]:
df = pd.DataFrame({'A':2, 'B':list('xxyzz')})
df
Out[213]:
A B
0 2 x
1 2 x
2 2 y
3 2 z
4 2 z
In [229]:
df.groupby('A')['B'].apply(lambda x: x.unique()).reset_index()
Out[229]:
A B
0 2 [x, y, z]
df.groupby('A').filter(lambda x: len(x['B'].unique()) > 1)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With