Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to efficiently get cell values from multiple DataFrames to insert into a master DataFrame

I have 3 different DataFrames (1 master DataFrame and 2 additional DataFrames). I am trying to add a column to my master DataFrame, with the elements of the column being different cell values in the other two DataFrames. I am using two columns of the master DataFrame to figure out which of the 2 DataFrames I need to get data from, and two more columns to act as indexes to a particular cell in the selected DataFrame.


master_df = pd.DataFrame({
    'col1': ['M', 'F', 'F', 'M'],
    'col2': [0, 1, 2, 3],
    'col3': ['X', 'Z', 'Z', 'X'],
    'col4': [2021, 2022, 2023, 2024]
})

df1 = pd.DataFrame({
    2021: [.632, .214, .987, .555],
    2022: [.602, .232, .287, .552],
    2023: [.932, .209, .347, .725],
    2024: [.123, .234, .9873, .5005]
})

df2 = pd.DataFrame({
    2021: [.6123, .2214, .4987, .555],
    2022: [.6702, .232, .2897, .552],
    2023: [.9372, .2, .37, .725],
    2024: [.23, .24, .873, .005]
})

For each row of the master_df, if the col1 value is 'M' and the col3 value is 'X', I want to choose df1. If the col1 value is 'F' and the col3 value is 'Z', I want to choose df2. Once I have selected the appropriate DataFrame, I want to use col2 of the master_df as a row index and col4 of the master_df as a column index. Finally, I will get the selected cell value and put it into the new column to be added to the master_df.

In this example, master_df should look like this at the end:

master_df = pd.DataFrame({
    'col1': ['M', 'F', 'F', 'M'],
    'col2': [0, 1, 2, 3],
    'col3': ['X', 'Z', 'Z', 'X'],
    'col4': [2021, 2022, 2023, 2024],
    'col5': [.632, .232, .37, .5005]
})

I have tried using a for loop to iterate through the master_df, but it is extremely slow since the DataFrames that I'm working with have millions of rows each. Any efficient pandas solutions for this?

like image 680
hbdch Avatar asked Jul 24 '19 22:07

hbdch


People also ask

Which is faster append or concat?

concat function is 50 times faster than using the DataFrame. append version. With multiple append , a new DataFrame is created at each iteration, and the underlying data is copied each time.

What is the most efficient way to loop through DataFrames with pandas?

Vectorization is always the first and best choice. You can convert the data frame to NumPy array or into dictionary format to speed up the iteration workflow. Iterating through the key-value pair of dictionaries comes out to be the fastest way with around 280x times speed up for 20 million records.

How do I combine multiple data frames into one?

The concat() function can be used to concatenate two Dataframes by adding the rows of one to the other. The merge() function is equivalent to the SQL JOIN clause. 'left', 'right' and 'inner' joins are all possible.


1 Answers

Your master_df has only 2 combinations of value for master_df.col1 and master_df.col3. Therefore, a simple .lookup and np.where will yield your desired output

df1_val = df1.lookup(master_df.col2, master_df.col4)
df2_val = df2.lookup(master_df.col2, master_df.col4)
master_df['col5'] = np.where(master_df.col1.eq('M') & master_df.col3.eq('X'), df1_val, df2_val)

Out[595]:
  col1  col2 col3  col4    col5
0  M    0     X    2021  0.6320
1  F    1     Z    2022  0.2320
2  F    2     Z    2023  0.3700
3  M    3     X    2024  0.5005

Note: if master_df.col1 and master_df.col3 have more than 2 combinations of values, you just need np.select instead of np.where

like image 95
Andy L. Avatar answered Oct 23 '22 04:10

Andy L.