Finding identical rows from 3 different pandas dataframes where rows are not similarly ordered

Question

I have three different pandas dataframes. The row orders in one dataframe is not related to that of other dataframes. I want to find a dataframe with rows, where each row will exist in three original dataframes at some index (which can differ across dataframes).

Let's see an example:

import pandas as pd 

# initialize list of lists 
data1 = [['tom', 10], ['mike', 15], ['lucy', 20], ['jerry', 5]] 
data2 = [['david', 5], ['bran', 6], ['tom', 10], ['mike', 15], ['jimmy', 20], ['kevin', 5], ['juli', 6]]
data3 = [['mike', 15], ['tom', 10], ['lucy', 20], ['ron', 5]]

# Create the pandas DataFrame 
df1 = pd.DataFrame(data1, columns = ['Name', 'Age']) 
df2 = pd.DataFrame(data2, columns = ['Name', 'Age']) 
df3 = pd.DataFrame(data3, columns = ['Name', 'Age'])

If we print these dataframes, we get

>> df1

    Name    Age
0   tom     10
1   mike    15
2   lucy    20
3   jerry   5

>> df2
    Name    Age
0   david   5
1   bran    6
2   tom     10
3   mike    15
4   jimmy   20
5   kevin   5
6   juli    6

>> df3
    Name    Age
0   mike    15
1   tom     10
2   lucy    20
3   ron     5

Note that the rows with values (tom, 10) and (mike, 15) occur in all 3 dataframes. So, the dataframe, I want to create would look like the following (the order of rows is not relevant):

    Name    Age
  0 mike    15
  1 tom     10

How can I do it in efficient manner? In my actual problem, I have 6 columns and number of rows in each dataframe can be in between 30 and 200.

oppressionslayer · Accepted Answer

You can use pivot then use aggfunc='size' to do this:

df = pd.concat([df1,df2,df3]) 
maxlen = df.pivot_table(index=['Name', 'Age'], aggfunc='size').max()
new_df = (df.pivot_table(index=['Name', 'Age'], aggfunc='size') == maxlen).reset_index() 
new_df[new_df[0] == True].reset_index(drop=True).drop(columns=0)


   Name  Age
0  mike   15
1   tom   10

You can also use groupby ( a simpler solution) :

new_df = pd.concat([df1,df2,df3]).groupby(['Name','Age']).size() 
new_df[new_df == new_df.max()] 

Name  Age
mike  15     3
tom   10     3
dtype: int64

and if you don't want the extra column:

new_df[new_df == new_df.max()].reset_index().drop(columns=0)

   Name  Age
0  mike   15
1   tom   10

hemanta · Answer

I used merge in pandas Dataframe.

pd.merge(pd.merge(df1, df2, how = 'inner', on='Name'), df3, how='inner', on='Name')

Experts, could you please comment on this approach? I am sure it is best for two Dataframes.

Finding identical rows from 3 different pandas dataframes where rows are not similarly ordered

Tags:

python

sql

pandas

dataframe

Atia Amin

2 Answers

oppressionslayer

hemanta

Recent Activity

Donate For Us

Finding identical rows from 3 different pandas dataframes where rows are not similarly ordered

Tags:

python

sql

pandas

dataframe

Atia Amin

2 Answers

oppressionslayer

hemanta

Related questions

Recent Activity

Donate For Us