When you merge two indexed dataframes on certain values using 'outer' merge, python/pandas automatically adds Null (NaN) values to the fields it could not match on. This is normal behaviour, but it changes the data type and you have to restate what data types the columns should have.
fillna()
or dropna()
do not seem to preserve data types immediately after the merge. Do I need a table structure in place?
Typically I would run numpy np.where(field.isnull() etc)
but that means running for all columns.
Is there a workaround to this?
Answer. Yes. Order of the merged dataframes will effect the order of the rows and columns of the merged dataframe. When using the merge() method, it will preserve the order of the left keys.
Pandas. DataFrame doesn't preserve the column order when converting from a DataFrames.
Answer. Yes, by default, concatenating dataframes will preserve their row order. The order of the dataframes to concatenate will be the order of the result dataframe.
This should really only be an issue with bool
or int
dtypes. float
, object
and datetime64[ns]
can already hold NaN
or NaT
without changing the type.
Because of this, I'd recommend using the new nullable dtypes. You can use Int64
for your integer and 'boolean'
for your Boolean columns. Both of these now support missing values with <NA>
: pandas._libs.missing.NAType
import pandas as pd
df = pd.DataFrame({'a': [1]*6, 'b': [1, 2]*3, 'c': range(6)})
df2 = pd.DataFrame({'d': [1, 2], 'e': [True, False]})
df2['d'] = df2['d'].astype('Int64')
df2['e'] = df2['e'].astype('boolean')
df2.dtypes
#d Int64
#e boolean
#dtype: object
df.join(df2)
# a b c d e
#0 1 1 0 1 True
#1 1 2 1 2 False
#2 1 1 2 <NA> <NA>
#3 1 2 3 <NA> <NA>
#4 1 1 4 <NA> <NA>
#5 1 2 5 <NA> <NA>
df.join(df2).dtypes
#a int64
#b int64
#c int64
#d Int64 <- dtype preserved
#e boolean <- dtype preserved
With Int64
/Bool64
the fill value remains true to what you specify and the column is only upcast if you fill with a value incapable of fitting in the current dtype.
I don't think there's any really elegant/efficient way to do it. You could do it by tracking the original datatypes and then casting the columns after the merge, like this:
import pandas as pd
# all types are originally ints
df = pd.DataFrame({'a': [1]*10, 'b': [1, 2] * 5, 'c': range(10)})
df2 = pd.DataFrame({'e': [1, 1], 'd': [1, 2]})
# track the original dtypes
orig = df.dtypes.to_dict()
orig.update(df2.dtypes.to_dict())
# join the dataframe
joined = df.join(df2, how='outer')
# columns with nans are now float dtype
print joined.dtypes
# replace nans with suitable int value
joined.fillna(-1, inplace=True)
# re-cast the columns as their original dtype
joined_orig_types = joined.apply(lambda x: x.astype(orig[x.name]))
print joined_orig_types.dtypes
As of pandas 1.0.0 I believe you have another option, which is to first use convert_dtypes. This converts the dataframe columns to dtypes that support pd.NA, avoiding the issues with NaN. This preserves the bool values as well unlike this answer.
...
df = pd.DataFrame({'a': [1]*6, 'b': [1, 2]*3, 'c': range(6)})
df2 = pd.DataFrame({'d': [1,2], 'e': [True, False]})
df = df.convert_dtypes()
df2 = df2.convert_dtypes()
print(df.join(df2))
# a b c d e
#0 1 1 0 1 True
#1 1 2 1 2 False
#2 1 1 2 <NA> <NA>
#3 1 2 3 <NA> <NA>
#4 1 1 4 <NA> <NA>
#5 1 2 5 <NA> <NA>
Or you can just do a concat/append on dtypes
of both df
s and applyastype()
:
joined = df.join(df2, how='outer').fillna(-1).astype(pd.concat([df.dtypes,df2.dtypes]))
#or joined = df.join(df2, how='outer').fillna(-1).astype(df.dtypes.append(df2.dtypes))
print(joined)
a b c e d
0 1 1 0 1 1
1 1 2 1 1 2
2 1 1 2 -1 -1
3 1 2 3 -1 -1
4 1 1 4 -1 -1
5 1 2 5 -1 -1
6 1 1 6 -1 -1
7 1 2 7 -1 -1
8 1 1 8 -1 -1
9 1 2 9 -1 -1
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With