my question is very similar to Merge pandas dataframe, with column operation but it doesn't answer my needs.
Let's say I have two dataframes such as (note that the dataframe content could be float numbers instead of booleans):
left = pd.DataFrame({0: [True, True, False], 0.5: [False, True, True]}, index=[12.5, 14, 15.5])
right = pd.DataFrame({0.7: [True, False, False], 0.5: [True, False, True]}, index=[12.5, 14, 15.5])
0.5 0.7
12.5 True True
14.0 False False
15.5 True False
0.0 0.5
12.5 True False
14.0 True True
15.5 False True
As you can see they have the same indexes and one of the column is common. In real life there might be more common columns such as one more at 1.0 or other numbers not yet defined, and more unique columns on each side. I need to combine the two dataframes such that all unique columns are kept and the common columns are combined using a specific function e.g. a boolean OR for this example, while the indexes are always identical for both dataframes.
So the result should be:
0.0 0.5 0.7
12.5 True True True
14.0 True True False
15.5 False True False
In real life there will be more than two dataframes that need to be combined, but they can be combined sequentially one after the other to an empty first dataframe.
I feel pandas.combine might do the trick but I can't figure it out from the documentation. Anybody would have a suggestion on how to do it in one or more steps.
To merge two Pandas DataFrame with common column, use the merge() function and set the ON parameter as the column name.
We can join columns from two Dataframes using the merge() function.
One way to combine or concatenate DataFrames is concat() function. It can be used to concatenate DataFrames along rows or columns by changing the axis parameter.
Dataframes in Pandas can be merged using pandas. merge() method. Returns : A DataFrame of the two merged objects. While working on datasets there may be a need to merge two data frames with some complex conditions, below are some examples of merging two data frames with some complex conditions.
You can concatenate the dataframes, and then groupby the column names to apply an operation on the similarly named columns: In this case you can get away with taking the sum and then typecasting back to bool to get the or
operation.
import pandas as pd
df = pd.concat([left, right], 1)
df.groupby(df.columns, 1).sum().astype(bool)
0.0 0.5 0.7
12.5 True True True
14.0 True True False
15.5 False True False
If you need to see how to do this in a less case-specific manner, then again just group by the columns and apply something to the grouped object over axis=1
df = pd.concat([left, right], 1)
df.groupby(df.columns, 1).apply(lambda x: x.any(1))
# 0.0 0.5 0.7
#12.5 True True True
#14.0 True True False
#15.5 False True False
Further, you can define a custom combining function. Here's one which adds twice the left Frame to 4 times the right Frame. If there is only one column, it returns 2x the left frame.
left:
0.0 0.5
12.5 1 11
14.0 2 17
15.5 3 17
right:
0.7 0.5
12.5 4 2
14.0 4 -1
15.5 5 5
def my_func(x):
try:
res = x.iloc[:, 0]*2 + x.iloc[:, 1]*4
except IndexError:
res = x.iloc[:, 0]*2
return res
df = pd.concat([left, right], 1)
df.groupby(df.columns, 1).apply(lambda x: my_func(x))
0.0 0.5 0.7
12.5 2 30 8
14.0 4 30 8
15.5 6 54 10
Finally, if you wanted to do this in a consecutive manner, then you should make use of reduce
. Here I'll combine 5 DataFrames
with the above function. (I'll just repeat the right Frame 4x for the example)
from functools import reduce
def my_comb(df_l, df_r, func):
""" Concatenate df_l and df_r along axis=1. Apply the
specified function.
"""
df = pd.concat([df_l, df_r], 1)
return df.groupby(df.columns, 1).apply(lambda x: func(x))
reduce(lambda dfl, dfr: my_comb(dfl, dfr, func=my_func), [left, right, right, right, right])
# 0.0 0.5 0.7
#12.5 16 296 176
#14.0 32 212 176
#15.5 48 572 220
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With