This is rather simple but I can't get me head around it. Let's say for the following data frame, I want to keep only the rows with duplicated values in column y:
>>> df
x y
x y
0 1 1
1 2 2
2 3 2
3 4 3
4 5 3
5 6 3
6 7 5
7 8 2
The desired output looks like:
>>> df
x y
1 2 2
2 3 2
3 4 3
4 5 3
5 6 3
7 8 2
I tried this:
df[~df.duplicated('y')]
but I get this:
x y
0 1 1
1 2 2
3 4 3
6 7 5
Dropping duplicate rows We can use Pandas built-in method drop_duplicates() to drop duplicate rows. Note that we started out as 80 rows, now it's 77. By default, this method returns a new DataFrame with duplicate rows removed. We can set the argument inplace=True to remove duplicates from the original DataFrame.
Remove All Duplicate Rows from Pandas DataFrame You can set 'keep=False' in the drop_duplicates() function to remove all the duplicate rows. For E.x, df. drop_duplicates(keep=False) .
And you can use the following syntax to select unique rows across specific columns in a pandas DataFrame: df = df. drop_duplicates(subset=['col1', 'col2', ...])
Docs: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.duplicated.html
keep : {‘first’, ‘last’, False}, default ‘first’
first : Mark duplicates as True except for the first occurrence.
last : Mark duplicates as True except for the last occurrence.
False : Mark all duplicates as True.
Meaning you are looking for:
df[df.duplicated('y',keep=False)]
Output:
x y
1 2 2
2 3 2
3 4 3
4 5 3
5 6 3
7 8 2
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With