I have one initial dataframe df1:
df1 = pd.DataFrame(np.array([[1, 'B', 'C', 'D', 'E'], [2, 'B', 'C', 'D', 'E'], [3, 'B', 'C', 'D', 'E'], [4, 'B', 'C', 'D', 'E'], [5, 'B', 'C', 'D', 'E']]), columns=['a', 'b', 'c', 'd', 'e'])
a b c d e
0 1 B C D E
1 2 B C D E
2 3 B C D E
3 4 B C D E
4 5 B C D E
Then I compute some new parameters based on df1 column values, create a new df2 and merge with df1 on column name "a".
df2 = pd.DataFrame(np.array([[1, 'F', 'G'], [2, 'F', 'G']]), columns=['a', 'f', 'g'])
a f g
0 1 F G
1 2 F G
df1 = pd.merge(df1, df2, how='left', left_on=['a'], right_on = ['a'])
a b c d e f g
0 1 B C D E F G
1 2 B C D E F G
2 3 B C D E NaN NaN
3 4 B C D E NaN NaN
4 5 B C D E NaN NaN
This works perfectly fine, but in another loop event, I create a df3 with same columns as df2 but merge in this case does not work, it doesn't take into account that the same columns are already in df1.
IMPORTANT REMARK: This is for illustration purpose only, there are thousands of new dataframes to be added, one per loop step.
df3 = pd.DataFrame(np.array([[3, 'F', 'G']]), columns=['a', 'f', 'g'])
a f g
0 3 F G
df1 = pd.merge(df1, df3, how='left', left_on=['a'], right_on = ['a'])
a b c d e f_x g_x f_y g_y
0 1 B C D E F G NaN NaN
1 2 B C D E F G NaN NaN
2 3 B C D E NaN NaN F G
3 4 B C D E NaN NaN NaN NaN
4 5 B C D E NaN NaN NaN NaN
I just one to fill missing gaps using the already existing columns. This approach creates new columns (f_x, g_x, f_y, g_y)
.
Append and contact also does not work as they repeats information (repeated rows on "a").
Any advice on how to solve this? Final result after merging df1
with df2
, and after with df3
should be:
a b c d e f g
0 1 B C D E F G
1 2 B C D E F G
2 3 B C D E F G
3 4 B C D E NaN NaN
4 5 B C D E NaN NaN
Eventually all the columns will be filled during the loop, so the first added (df2) will add new columns, and from df3 onwards just new data to fill all NaN. The loop looks like this:
df1 = pd.DataFrame(np.array([[1, 'B', 'C', 'D', 'E'], [2, 'B', 'C', 'D', 'E'], [3, 'B', 'C', 'D', 'E'], [4, 'B', 'C', 'D', 'E'], [5, 'B', 'C', 'D', 'E']]), columns=['a', 'b', 'c', 'd', 'e'])
for num, item in enumerate(df1['a']):
#compute df[num] (based on values on df1)
df1 = pd.merge(df1, df[num], how='left', left_on=['a'], right_on = ['a'])
append() function is used to append rows of other dataframe to the end of the given dataframe, returning a new dataframe object. Columns not in the original dataframes are added as new columns and the new cells are populated with NaN value. Parameters: other : DataFrame or Series/dict-like object, or list of these.
Using DataFrame. insert() method, we can add new columns at specific position of the column name sequence. Although insert takes single column name, value as input, but we can use it repeatedly to add multiple columns to the DataFrame.
Combine Two Columns Using + Operator By use + operator simply you can combine/merge two or multiple text/string columns in pandas DataFrame. Note that when you apply + operator on numeric columns it actually does addition instead of concatenation.
One possible solution is concat
all small DataFrame
s and then only once merge
:
df4 = pd.concat([df2, df3])
print (df4)
a f g
0 1 F G
1 2 F G
0 3 F G
df1 = pd.merge(df1, df4, how='left', on = 'a')
print (df1)
a b c d e f g
0 1 B C D E F G
1 2 B C D E F G
2 3 B C D E F G
3 4 B C D E NaN NaN
4 5 B C D E NaN NaN
Another possible solution is use DataFrame.combine_first
with DataFrame.set_index
:
df1 = (df1.set_index('a')
.combine_first(df2.set_index('a'))
.combine_first(df3.set_index('a')))
print (df1)
b c d e f g
a
1 B C D E F G
2 B C D E F G
3 B C D E F G
4 B C D E NaN NaN
5 B C D E NaN NaN
Another way is too use fillna
then drop the extra columns you dont need anymore:
# Fill NaN with the extra columns value
df1.f_x.fillna(df1.f_y, inplace=True)
df1.g_x.fillna(df1.g_y, inplace=True)
a b c d e f_x g_x f_y g_y
0 1 B C D E F G NaN NaN
1 2 B C D E F G NaN NaN
2 3 B C D E F G F G
3 4 B C D E NaN NaN NaN NaN
4 5 B C D E NaN NaN NaN NaN
# Slice of the last two columns
df1 = df1.iloc[:, :-2]
# Rename the columns correctly
df1.columns = df1.columns.str.replace('_x', '')
Output
a b c d e f g
0 1 B C D E F G
1 2 B C D E F G
2 3 B C D E F G
3 4 B C D E NaN NaN
4 5 B C D E NaN NaN
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With