Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Adding calculated columns and then just new data to a Pandas dataframe iteratively (python 3.7.1)

Tags:

I have one initial dataframe df1:

    df1 = pd.DataFrame(np.array([[1, 'B', 'C', 'D', 'E'], [2, 'B', 'C', 'D', 'E'], [3, 'B', 'C', 'D', 'E'], [4, 'B', 'C', 'D', 'E'], [5, 'B', 'C', 'D', 'E']]), columns=['a', 'b', 'c', 'd', 'e'])

        a   b   c   d   e
    0   1   B   C   D   E
    1   2   B   C   D   E
    2   3   B   C   D   E
    3   4   B   C   D   E
    4   5   B   C   D   E

Then I compute some new parameters based on df1 column values, create a new df2 and merge with df1 on column name "a".

    df2 = pd.DataFrame(np.array([[1, 'F', 'G'], [2, 'F', 'G']]), columns=['a', 'f', 'g'])

        a   f   g
    0   1   F   G
    1   2   F   G
    df1 = pd.merge(df1, df2,  how='left', left_on=['a'], right_on = ['a'])

        a   b   c   d   e   f   g
    0   1   B   C   D   E   F   G
    1   2   B   C   D   E   F   G
    2   3   B   C   D   E   NaN NaN
    3   4   B   C   D   E   NaN NaN
    4   5   B   C   D   E   NaN NaN

This works perfectly fine, but in another loop event, I create a df3 with same columns as df2 but merge in this case does not work, it doesn't take into account that the same columns are already in df1.

IMPORTANT REMARK: This is for illustration purpose only, there are thousands of new dataframes to be added, one per loop step.

    df3 = pd.DataFrame(np.array([[3, 'F', 'G']]), columns=['a', 'f', 'g'])

        a   f   g
    0   3   F   G
df1 = pd.merge(df1, df3,  how='left', left_on=['a'], right_on = ['a'])

        a   b   c   d   e   f_x g_x f_y g_y
    0   1   B   C   D   E   F   G   NaN NaN
    1   2   B   C   D   E   F   G   NaN NaN
    2   3   B   C   D   E   NaN NaN F   G
    3   4   B   C   D   E   NaN NaN NaN NaN
    4   5   B   C   D   E   NaN NaN NaN NaN

I just one to fill missing gaps using the already existing columns. This approach creates new columns (f_x, g_x, f_y, g_y).

Append and contact also does not work as they repeats information (repeated rows on "a").

Any advice on how to solve this? Final result after merging df1 with df2, and after with df3 should be:

        a   b   c   d   e   f   g
    0   1   B   C   D   E   F   G
    1   2   B   C   D   E   F   G
    2   3   B   C   D   E   F   G
    3   4   B   C   D   E   NaN NaN
    4   5   B   C   D   E   NaN NaN

Eventually all the columns will be filled during the loop, so the first added (df2) will add new columns, and from df3 onwards just new data to fill all NaN. The loop looks like this:

df1 = pd.DataFrame(np.array([[1, 'B', 'C', 'D', 'E'], [2, 'B', 'C', 'D', 'E'], [3, 'B', 'C', 'D', 'E'], [4, 'B', 'C', 'D', 'E'], [5, 'B', 'C', 'D', 'E']]), columns=['a', 'b', 'c', 'd', 'e'])
for num, item in enumerate(df1['a']):
    #compute df[num] (based on values on df1)
    df1 = pd.merge(df1, df[num],  how='left', left_on=['a'], right_on = ['a'])
like image 698
juanman Avatar asked Mar 14 '19 11:03

juanman


People also ask

How do I append data to a pandas DataFrame in Python?

append() function is used to append rows of other dataframe to the end of the given dataframe, returning a new dataframe object. Columns not in the original dataframes are added as new columns and the new cells are populated with NaN value. Parameters: other : DataFrame or Series/dict-like object, or list of these.

How do I add multiple columns to a DataFrame in Python?

Using DataFrame. insert() method, we can add new columns at specific position of the column name sequence. Although insert takes single column name, value as input, but we can use it repeatedly to add multiple columns to the DataFrame.

How do I add columns together in pandas?

Combine Two Columns Using + Operator By use + operator simply you can combine/merge two or multiple text/string columns in pandas DataFrame. Note that when you apply + operator on numeric columns it actually does addition instead of concatenation.


2 Answers

One possible solution is concat all small DataFrames and then only once merge:

df4 = pd.concat([df2, df3])
print (df4)
   a  f  g
0  1  F  G
1  2  F  G
0  3  F  G

df1 = pd.merge(df1, df4,  how='left', on = 'a')
print (df1)
   a  b  c  d  e    f    g
0  1  B  C  D  E    F    G
1  2  B  C  D  E    F    G
2  3  B  C  D  E    F    G
3  4  B  C  D  E  NaN  NaN
4  5  B  C  D  E  NaN  NaN

Another possible solution is use DataFrame.combine_first with DataFrame.set_index:

df1 = (df1.set_index('a')
         .combine_first(df2.set_index('a'))
         .combine_first(df3.set_index('a')))
print (df1)
   b  c  d  e    f    g
a                      
1  B  C  D  E    F    G
2  B  C  D  E    F    G
3  B  C  D  E    F    G
4  B  C  D  E  NaN  NaN
5  B  C  D  E  NaN  NaN
like image 144
jezrael Avatar answered Nov 15 '22 04:11

jezrael


Another way is too use fillna then drop the extra columns you dont need anymore:

# Fill NaN with the extra columns value
df1.f_x.fillna(df1.f_y, inplace=True)
df1.g_x.fillna(df1.g_y, inplace=True)

   a  b  c  d  e  f_x  g_x  f_y  g_y
0  1  B  C  D  E    F    G  NaN  NaN
1  2  B  C  D  E    F    G  NaN  NaN
2  3  B  C  D  E    F    G    F    G
3  4  B  C  D  E  NaN  NaN  NaN  NaN
4  5  B  C  D  E  NaN  NaN  NaN  NaN

# Slice of the last two columns
df1 = df1.iloc[:, :-2]
# Rename the columns correctly
df1.columns = df1.columns.str.replace('_x', '')

Output

   a  b  c  d  e    f    g
0  1  B  C  D  E    F    G
1  2  B  C  D  E    F    G
2  3  B  C  D  E    F    G
3  4  B  C  D  E  NaN  NaN
4  5  B  C  D  E  NaN  NaN
like image 41
Erfan Avatar answered Nov 15 '22 05:11

Erfan