I have three different pandas dataframes. The row orders in one dataframe is not related to that of other dataframes. I want to find a dataframe with rows, where each row will exist in three original dataframes at some index (which can differ across dataframes).
Let's see an example:
import pandas as pd
# initialize list of lists
data1 = [['tom', 10], ['mike', 15], ['lucy', 20], ['jerry', 5]]
data2 = [['david', 5], ['bran', 6], ['tom', 10], ['mike', 15], ['jimmy', 20], ['kevin', 5], ['juli', 6]]
data3 = [['mike', 15], ['tom', 10], ['lucy', 20], ['ron', 5]]
# Create the pandas DataFrame
df1 = pd.DataFrame(data1, columns = ['Name', 'Age'])
df2 = pd.DataFrame(data2, columns = ['Name', 'Age'])
df3 = pd.DataFrame(data3, columns = ['Name', 'Age'])
If we print these dataframes, we get
>> df1
Name Age
0 tom 10
1 mike 15
2 lucy 20
3 jerry 5
>> df2
Name Age
0 david 5
1 bran 6
2 tom 10
3 mike 15
4 jimmy 20
5 kevin 5
6 juli 6
>> df3
Name Age
0 mike 15
1 tom 10
2 lucy 20
3 ron 5
Note that the rows with values (tom, 10) and (mike, 15) occur in all 3 dataframes. So, the dataframe, I want to create would look like the following (the order of rows is not relevant):
Name Age
0 mike 15
1 tom 10
How can I do it in efficient manner? In my actual problem, I have 6 columns and number of rows in each dataframe can be in between 30 and 200.
You can use pivot then use aggfunc='size' to do this:
df = pd.concat([df1,df2,df3])
maxlen = df.pivot_table(index=['Name', 'Age'], aggfunc='size').max()
new_df = (df.pivot_table(index=['Name', 'Age'], aggfunc='size') == maxlen).reset_index()
new_df[new_df[0] == True].reset_index(drop=True).drop(columns=0)
Name Age
0 mike 15
1 tom 10
You can also use groupby ( a simpler solution) :
new_df = pd.concat([df1,df2,df3]).groupby(['Name','Age']).size()
new_df[new_df == new_df.max()]
Name Age
mike 15 3
tom 10 3
dtype: int64
and if you don't want the extra column:
new_df[new_df == new_df.max()].reset_index().drop(columns=0)
Name Age
0 mike 15
1 tom 10
I used merge in pandas Dataframe.
pd.merge(pd.merge(df1, df2, how = 'inner', on='Name'), df3, how='inner', on='Name')
Experts, could you please comment on this approach? I am sure it is best for two Dataframes.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With