Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Multiple sets of duplicate records from a pandas dataframe

How to get all the existing duplicated sets of records(based on a column) from a dataframe?

I got a dataframe as follows:

flight_id | from_location  | to_location |  schedule |  
1         |   Vancouver    |   Toronto   |   3-Jan   |  
2         |   Amsterdam    |   Tokyo     |   15-Feb  |  
4         |   Fairbanks    |   Glasgow   |   12-Jan  |  
9         |   Halmstad     |   Athens    |   21-Jan  |  
3         |   Brisbane     |   Lisbon    |   4-Feb   |  
4         | Johannesburg   |   Venice    |   23-Jan  |
9         | LosAngeles     |  Perth      |   3-Mar   |

Here flight_id is the column on which I need to check duplicates. And there are 2 sets of duplicates.

Output for this specific example should look like--[(2,5),(3,6)]. List of tuples of record index values

like image 218
Kingz Avatar asked Mar 23 '18 19:03

Kingz


People also ask

How do you separate duplicates in Pandas?

We can use Pandas built-in method drop_duplicates() to drop duplicate rows. Note that we started out as 80 rows, now it's 77. By default, this method returns a new DataFrame with duplicate rows removed. We can set the argument inplace=True to remove duplicates from the original DataFrame.

How do you find duplicates in two sets of data?

Here is how to do it: Select the data. Go to Home –> Conditional Formatting –> Highlight Cell Rules –> Duplicate Values. In the Duplicate Values dialog box, select Duplicate in the drop down on the left, and specify the format in which you want to highlight the duplicate values.

How do I get rid of duplicate rows in Pandas?

Remove All Duplicate Rows from Pandas DataFrame You can set 'keep=False' in the drop_duplicates() function to remove all the duplicate rows. For E.x, df. drop_duplicates(keep=False) .

Can DataFrame have duplicate rows?

DataFrame. duplicated() method is used to find duplicate rows in a DataFrame. It returns a boolean series which identifies whether a row is duplicate or unique. In this article, you will learn how to use this method to identify the duplicate rows in a DataFrame.


1 Answers

Is this what you need ? duplicated+groupby

(df.loc[df['flight_id'].duplicated(keep=False)].reset_index()).groupby('flight_id')['index'].apply(tuple)
Out[510]: 
flight_id
4    (2, 5)
9    (3, 6)
Name: index, dtype: object

Adding tolist at the end

(df.loc[df['flight_id'].duplicated(keep=False)].reset_index()).groupby('flight_id')['index'].apply(tuple).tolist()
Out[511]: [(2, 5), (3, 6)]

And another solution ... for fun only

s=df['flight_id'].value_counts()
list(map(lambda x : tuple(df[df['flight_id']==x].index.tolist()), s[s.gt(1)].index))
Out[519]: [(2, 5), (3, 6)]
like image 150
BENY Avatar answered Oct 22 '22 13:10

BENY