I have 3 different DataFrames (1 master DataFrame and 2 additional DataFrames). I am trying to add a column to my master DataFrame, with the elements of the column being different cell values in the other two DataFrames. I am using two columns of the master DataFrame to figure out which of the 2 DataFrames I need to get data from, and two more columns to act as indexes to a particular cell in the selected DataFrame.
master_df = pd.DataFrame({
'col1': ['M', 'F', 'F', 'M'],
'col2': [0, 1, 2, 3],
'col3': ['X', 'Z', 'Z', 'X'],
'col4': [2021, 2022, 2023, 2024]
})
df1 = pd.DataFrame({
2021: [.632, .214, .987, .555],
2022: [.602, .232, .287, .552],
2023: [.932, .209, .347, .725],
2024: [.123, .234, .9873, .5005]
})
df2 = pd.DataFrame({
2021: [.6123, .2214, .4987, .555],
2022: [.6702, .232, .2897, .552],
2023: [.9372, .2, .37, .725],
2024: [.23, .24, .873, .005]
})
For each row of the master_df
, if the col1
value is 'M'
and the col3
value is 'X'
, I want to choose df1
. If the col1
value is 'F'
and the col3
value is 'Z'
, I want to choose df2
. Once I have selected the appropriate DataFrame
, I want to use col2
of the master_df
as a row index and col4
of the master_df
as a column index. Finally, I will get the selected cell value and put it into the new column to be added to the master_df
.
In this example, master_df
should look like this at the end:
master_df = pd.DataFrame({
'col1': ['M', 'F', 'F', 'M'],
'col2': [0, 1, 2, 3],
'col3': ['X', 'Z', 'Z', 'X'],
'col4': [2021, 2022, 2023, 2024],
'col5': [.632, .232, .37, .5005]
})
I have tried using a for loop to iterate through the master_df
, but it is extremely slow since the DataFrame
s that I'm working with have millions of rows each. Any efficient pandas solutions for this?
concat function is 50 times faster than using the DataFrame. append version. With multiple append , a new DataFrame is created at each iteration, and the underlying data is copied each time.
Vectorization is always the first and best choice. You can convert the data frame to NumPy array or into dictionary format to speed up the iteration workflow. Iterating through the key-value pair of dictionaries comes out to be the fastest way with around 280x times speed up for 20 million records.
The concat() function can be used to concatenate two Dataframes by adding the rows of one to the other. The merge() function is equivalent to the SQL JOIN clause. 'left', 'right' and 'inner' joins are all possible.
Your master_df
has only 2 combinations of value for master_df.col1
and master_df.col3
. Therefore, a simple .lookup
and np.where
will yield your desired output
df1_val = df1.lookup(master_df.col2, master_df.col4)
df2_val = df2.lookup(master_df.col2, master_df.col4)
master_df['col5'] = np.where(master_df.col1.eq('M') & master_df.col3.eq('X'), df1_val, df2_val)
Out[595]:
col1 col2 col3 col4 col5
0 M 0 X 2021 0.6320
1 F 1 Z 2022 0.2320
2 F 2 Z 2023 0.3700
3 M 3 X 2024 0.5005
Note: if master_df.col1
and master_df.col3
have more than 2 combinations of values, you just need np.select
instead of np.where
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With