This is rather simple but I can't get me head around it. Let's say for the following data frame, I want to keep only the rows with duplicated values in column y:
>>> df
   x  y
    x   y
0   1   1
1   2   2
2   3   2
3   4   3
4   5   3
5   6   3
6   7   5
7   8   2
The desired output looks like:
>>> df
    x   y
1   2   2
2   3   2
3   4   3
4   5   3
5   6   3
7   8   2
I tried this:
df[~df.duplicated('y')]
but I get this:
    x   y
0   1   1
1   2   2
3   4   3
6   7   5
                Dropping duplicate rows We can use Pandas built-in method drop_duplicates() to drop duplicate rows. Note that we started out as 80 rows, now it's 77. By default, this method returns a new DataFrame with duplicate rows removed. We can set the argument inplace=True to remove duplicates from the original DataFrame.
Remove All Duplicate Rows from Pandas DataFrame You can set 'keep=False' in the drop_duplicates() function to remove all the duplicate rows. For E.x, df. drop_duplicates(keep=False) .
And you can use the following syntax to select unique rows across specific columns in a pandas DataFrame: df = df. drop_duplicates(subset=['col1', 'col2', ...])
Docs: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.duplicated.html
keep : {‘first’, ‘last’, False}, default ‘first’
first : Mark duplicates as True except for the first occurrence.
last : Mark duplicates as True except for the last occurrence.
False : Mark all duplicates as True.
Meaning you are looking for:
df[df.duplicated('y',keep=False)]
Output:
    x   y
1   2   2
2   3   2
3   4   3
4   5   3
5   6   3
7   8   2
                        If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With