I'm running into a strange issue where combine_first method is causing values stored as bool to be upcasted into float64s. Example:
In [1]: import pandas as pd
In [2]: df1 = pd.DataFrame({"a": [True]})
In [3]: df2 = pd.DataFrame({"b": ['test']})
In [4]: df2.combine_first(df1)
Out[4]:
     a     b
0  1.0  test
This problem has already been reported in a previous post 3 years ago: pandas DataFrame combine_first and update methods have strange behavior. This issue was told to be solved but I still have this behaviour under pandas 0.18.1
thank you for your help
Somewhere along the chain of events to get to a combined dataframe, potential missing values had to be addressed.  I'm aware that nothing is missing in your example.  None and np.nan are not int, or bool.  So in order to have a common dtype that contains a bool and a None or np.nan it is necessary to cast the column as either object or float.  As 'float`, a large number of operations become far more efficient and is a decent choice.  It obviously isn't the best choice all of the time, but a choice has to be made none the less and pandas tried to infer the best one.
A work around:
Setup
df1 = pd.DataFrame({"a": [True]})
df2 = pd.DataFrame({"b": ['test']})
df3 = df2.combine_first(df1)
df3

Solution
dtypes = df1.dtypes.combine_first(df2.dtypes)
for k, v in dtypes.iteritems():
    df3[k] = df3[k].astype(v)
df3

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With