Adding calculated columns and then just new data to a Pandas dataframe iteratively (python 3.7.1)

Tags:

I have one initial dataframe df1:

    df1 = pd.DataFrame(np.array([[1, 'B', 'C', 'D', 'E'], [2, 'B', 'C', 'D', 'E'], [3, 'B', 'C', 'D', 'E'], [4, 'B', 'C', 'D', 'E'], [5, 'B', 'C', 'D', 'E']]), columns=['a', 'b', 'c', 'd', 'e'])

        a   b   c   d   e
    0   1   B   C   D   E
    1   2   B   C   D   E
    2   3   B   C   D   E
    3   4   B   C   D   E
    4   5   B   C   D   E

Then I compute some new parameters based on df1 column values, create a new df2 and merge with df1 on column name "a".

    df2 = pd.DataFrame(np.array([[1, 'F', 'G'], [2, 'F', 'G']]), columns=['a', 'f', 'g'])

        a   f   g
    0   1   F   G
    1   2   F   G

    df1 = pd.merge(df1, df2,  how='left', left_on=['a'], right_on = ['a'])

        a   b   c   d   e   f   g
    0   1   B   C   D   E   F   G
    1   2   B   C   D   E   F   G
    2   3   B   C   D   E   NaN NaN
    3   4   B   C   D   E   NaN NaN
    4   5   B   C   D   E   NaN NaN

This works perfectly fine, but in another loop event, I create a df3 with same columns as df2 but merge in this case does not work, it doesn't take into account that the same columns are already in df1.

IMPORTANT REMARK: This is for illustration purpose only, there are thousands of new dataframes to be added, one per loop step.

    df3 = pd.DataFrame(np.array([[3, 'F', 'G']]), columns=['a', 'f', 'g'])

        a   f   g
    0   3   F   G

df1 = pd.merge(df1, df3,  how='left', left_on=['a'], right_on = ['a'])

        a   b   c   d   e   f_x g_x f_y g_y
    0   1   B   C   D   E   F   G   NaN NaN
    1   2   B   C   D   E   F   G   NaN NaN
    2   3   B   C   D   E   NaN NaN F   G
    3   4   B   C   D   E   NaN NaN NaN NaN
    4   5   B   C   D   E   NaN NaN NaN NaN

I just one to fill missing gaps using the already existing columns. This approach creates new columns (f_x, g_x, f_y, g_y).

Append and contact also does not work as they repeats information (repeated rows on "a").

Any advice on how to solve this? Final result after merging df1 with df2, and after with df3 should be:

        a   b   c   d   e   f   g
    0   1   B   C   D   E   F   G
    1   2   B   C   D   E   F   G
    2   3   B   C   D   E   F   G
    3   4   B   C   D   E   NaN NaN
    4   5   B   C   D   E   NaN NaN

Eventually all the columns will be filled during the loop, so the first added (df2) will add new columns, and from df3 onwards just new data to fill all NaN. The loop looks like this:

df1 = pd.DataFrame(np.array([[1, 'B', 'C', 'D', 'E'], [2, 'B', 'C', 'D', 'E'], [3, 'B', 'C', 'D', 'E'], [4, 'B', 'C', 'D', 'E'], [5, 'B', 'C', 'D', 'E']]), columns=['a', 'b', 'c', 'd', 'e'])

for num, item in enumerate(df1['a']):
    #compute df[num] (based on values on df1)
    df1 = pd.merge(df1, df[num],  how='left', left_on=['a'], right_on = ['a'])

698

asked Mar 14 '19 11:03

juanman

2 Answers

One possible solution is concat all small DataFrames and then only once merge:

df4 = pd.concat([df2, df3])
print (df4)
   a  f  g
0  1  F  G
1  2  F  G
0  3  F  G

df1 = pd.merge(df1, df4,  how='left', on = 'a')
print (df1)
   a  b  c  d  e    f    g
0  1  B  C  D  E    F    G
1  2  B  C  D  E    F    G
2  3  B  C  D  E    F    G
3  4  B  C  D  E  NaN  NaN
4  5  B  C  D  E  NaN  NaN

Another possible solution is use DataFrame.combine_first with DataFrame.set_index:

df1 = (df1.set_index('a')
         .combine_first(df2.set_index('a'))
         .combine_first(df3.set_index('a')))
print (df1)
   b  c  d  e    f    g
a                      
1  B  C  D  E    F    G
2  B  C  D  E    F    G
3  B  C  D  E    F    G
4  B  C  D  E  NaN  NaN
5  B  C  D  E  NaN  NaN

144

answered Nov 15 '22 04:11

jezrael

Another way is too use fillna then drop the extra columns you dont need anymore:

# Fill NaN with the extra columns value
df1.f_x.fillna(df1.f_y, inplace=True)
df1.g_x.fillna(df1.g_y, inplace=True)

   a  b  c  d  e  f_x  g_x  f_y  g_y
0  1  B  C  D  E    F    G  NaN  NaN
1  2  B  C  D  E    F    G  NaN  NaN
2  3  B  C  D  E    F    G    F    G
3  4  B  C  D  E  NaN  NaN  NaN  NaN
4  5  B  C  D  E  NaN  NaN  NaN  NaN

# Slice of the last two columns
df1 = df1.iloc[:, :-2]
# Rename the columns correctly
df1.columns = df1.columns.str.replace('_x', '')

Output

   a  b  c  d  e    f    g
0  1  B  C  D  E    F    G
1  2  B  C  D  E    F    G
2  3  B  C  D  E    F    G
3  4  B  C  D  E  NaN  NaN
4  5  B  C  D  E  NaN  NaN

answered Nov 15 '22 05:11

Erfan

Related questions
                            
                                Xcode 10.1 - How to fix broken copy/paste not working?
                            
                                Print part of a 2D list
                            
                                How To Wrap League Flysystem with Dependency Injection
                            
                                Efficient way of joining multiple tables in Spark - No space left on device
                            
                                Decoration with several interface
                            
                                How can I do a histogram with 1D gaussian mixture with sklearn?
                            
                                AVCaptureSession video stabilization lag
                            
                                Create a View Component template/container that accepts HTML or other components as parameters
                            
                                GitLab CI/CD unset variables set in CI / CD Settings
                            
                                make curved dotted line
                            
                                Is it possible to create a MediaStream from a .wav File?
                            
                                wrapping wide table in rmarkdown

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With