How to efficiently get cell values from multiple DataFrames to insert into a master DataFrame

Tags:

I have 3 different DataFrames (1 master DataFrame and 2 additional DataFrames). I am trying to add a column to my master DataFrame, with the elements of the column being different cell values in the other two DataFrames. I am using two columns of the master DataFrame to figure out which of the 2 DataFrames I need to get data from, and two more columns to act as indexes to a particular cell in the selected DataFrame.


master_df = pd.DataFrame({
    'col1': ['M', 'F', 'F', 'M'],
    'col2': [0, 1, 2, 3],
    'col3': ['X', 'Z', 'Z', 'X'],
    'col4': [2021, 2022, 2023, 2024]
})

df1 = pd.DataFrame({
    2021: [.632, .214, .987, .555],
    2022: [.602, .232, .287, .552],
    2023: [.932, .209, .347, .725],
    2024: [.123, .234, .9873, .5005]
})

df2 = pd.DataFrame({
    2021: [.6123, .2214, .4987, .555],
    2022: [.6702, .232, .2897, .552],
    2023: [.9372, .2, .37, .725],
    2024: [.23, .24, .873, .005]
})

For each row of the master_df, if the col1 value is 'M' and the col3 value is 'X', I want to choose df1. If the col1 value is 'F' and the col3 value is 'Z', I want to choose df2. Once I have selected the appropriate DataFrame, I want to use col2 of the master_df as a row index and col4 of the master_df as a column index. Finally, I will get the selected cell value and put it into the new column to be added to the master_df.

In this example, master_df should look like this at the end:

master_df = pd.DataFrame({
    'col1': ['M', 'F', 'F', 'M'],
    'col2': [0, 1, 2, 3],
    'col3': ['X', 'Z', 'Z', 'X'],
    'col4': [2021, 2022, 2023, 2024],
    'col5': [.632, .232, .37, .5005]
})

I have tried using a for loop to iterate through the master_df, but it is extremely slow since the DataFrames that I'm working with have millions of rows each. Any efficient pandas solutions for this?

680

asked Jul 24 '19 22:07

hbdch

1 Answers

Your master_df has only 2 combinations of value for master_df.col1 and master_df.col3. Therefore, a simple .lookup and np.where will yield your desired output

df1_val = df1.lookup(master_df.col2, master_df.col4)
df2_val = df2.lookup(master_df.col2, master_df.col4)
master_df['col5'] = np.where(master_df.col1.eq('M') & master_df.col3.eq('X'), df1_val, df2_val)

Out[595]:
  col1  col2 col3  col4    col5
0  M    0     X    2021  0.6320
1  F    1     Z    2022  0.2320
2  F    2     Z    2023  0.3700
3  M    3     X    2024  0.5005

Note: if master_df.col1 and master_df.col3 have more than 2 combinations of values, you just need np.select instead of np.where

answered Oct 23 '22 04:10

Andy L.

Related questions
                            
                                Is assigning two variables to items in the same list the best way to access and perform operations on those items?
                            
                                Spark Caused by: java.lang.StackOverflowError Window Function?
                            
                                Kubernetes argo loop through json array
                            
                                How to trigger a python function inside a tf.keras custom loss function?
                            
                                Struggling with understanding the reason why Python needs Virtual Environments
                            
                                How to create user in amazon-cognito using boto3 in python
                            
                                Can't save data from yfinance into a CSV file
                            
                                PANDAS: int32 overflow? Can't bulid a pivot table
                            
                                How to make Keras compute a certain metric on validation data only?
                            
                                No batch_size while making inference with BERT model
                            
                                Put the legend of pandas bar plot with secondary y axis in front of bars
                            
                                Why does zip return tuples?
                            
                                Airflow: Re-run DAG from beginning with new schedule
                            
                                Keras: What is the difference between model and layers?
                            
                                How to install plaidML / plaidML-keras
                            
                                How to add a proper 'meta.yaml' recipe file for creating a conda-forge package distribution? Particularly `test` section in recipe file?
                            
                                Install npm package with conda via environment.yml
                            
                                dictionary keys to replace strings in pandas dataframe column with dictionary values and perform evaluate
                            
                                Fastest way to search a list of named tuples?
                            
                                How to use a different C++ compiler in Cython?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to efficiently get cell values from multiple DataFrames to insert into a master DataFrame

Tags:

python

pandas

dataframe

hbdch

People also ask

1 Answers

Andy L.

Recent Activity

Donate For Us