How to get all the existing duplicated sets of records(based on a column) from a dataframe?
I got a dataframe as follows:
flight_id | from_location | to_location | schedule |
1 | Vancouver | Toronto | 3-Jan |
2 | Amsterdam | Tokyo | 15-Feb |
4 | Fairbanks | Glasgow | 12-Jan |
9 | Halmstad | Athens | 21-Jan |
3 | Brisbane | Lisbon | 4-Feb |
4 | Johannesburg | Venice | 23-Jan |
9 | LosAngeles | Perth | 3-Mar |
Here flight_id is the column on which I need to check duplicates. And there are 2 sets of duplicates.
Output for this specific example should look like--[(2,5),(3,6)]
. List of tuples of record index values
We can use Pandas built-in method drop_duplicates() to drop duplicate rows. Note that we started out as 80 rows, now it's 77. By default, this method returns a new DataFrame with duplicate rows removed. We can set the argument inplace=True to remove duplicates from the original DataFrame.
Here is how to do it: Select the data. Go to Home –> Conditional Formatting –> Highlight Cell Rules –> Duplicate Values. In the Duplicate Values dialog box, select Duplicate in the drop down on the left, and specify the format in which you want to highlight the duplicate values.
Remove All Duplicate Rows from Pandas DataFrame You can set 'keep=False' in the drop_duplicates() function to remove all the duplicate rows. For E.x, df. drop_duplicates(keep=False) .
DataFrame. duplicated() method is used to find duplicate rows in a DataFrame. It returns a boolean series which identifies whether a row is duplicate or unique. In this article, you will learn how to use this method to identify the duplicate rows in a DataFrame.
Is this what you need ? duplicated
+groupby
(df.loc[df['flight_id'].duplicated(keep=False)].reset_index()).groupby('flight_id')['index'].apply(tuple)
Out[510]:
flight_id
4 (2, 5)
9 (3, 6)
Name: index, dtype: object
Adding tolist
at the end
(df.loc[df['flight_id'].duplicated(keep=False)].reset_index()).groupby('flight_id')['index'].apply(tuple).tolist()
Out[511]: [(2, 5), (3, 6)]
And another solution ... for fun only
s=df['flight_id'].value_counts()
list(map(lambda x : tuple(df[df['flight_id']==x].index.tolist()), s[s.gt(1)].index))
Out[519]: [(2, 5), (3, 6)]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With