When you merge two indexed dataframes on certain values using 'outer' merge, python/pandas automatically adds Null (NaN) values to the fields it could not match on. This is normal behaviour, but it changes the data type and you have to restate what data types the columns should have. <code>fillna()</code> or <code>dropna()</code> do not seem to preserve data types immediately after the merge. Do I need a table structure in place? Typically I would run <code>numpy np.where(field.isnull() etc)</code> but that means running for all columns. Is there a workaround to this?

As of pandas 1.0.0 I believe you have another option, which is to first use convert_dtypes. This converts the dataframe columns to dtypes that support pd.NA, avoiding the issues with NaN. This preserves the bool values as well unlike this answer. <pre class="prettyprint"><code>... df = pd.DataFrame({'a': [1]*6, 'b': [1, 2]*3, 'c': range(6)}) df2 = pd.DataFrame({'d': [1,2], 'e': [True, False]}) df = df.convert_dtypes() df2 = df2.convert_dtypes() print(df.join(df2)) # a b c d e #0 1 1 0 1 True #1 1 2 1 2 False #2 1 1 2 <NA> <NA> #3 1 2 3 <NA> <NA> #4 1 1 4 <NA> <NA> #5 1 2 5 <NA> <NA> </code></pre>

Preserve Dataframe column data type after outer merge

Tags:

python

pandas

dataframe

pandas-merge

When you merge two indexed dataframes on certain values using 'outer' merge, python/pandas automatically adds Null (NaN) values to the fields it could not match on. This is normal behaviour, but it changes the data type and you have to restate what data types the columns should have.

fillna() or dropna() do not seem to preserve data types immediately after the merge. Do I need a table structure in place?

Typically I would run numpy np.where(field.isnull() etc) but that means running for all columns.

Is there a workaround to this?

398

asked Apr 20 '16 12:04

Jeff

4 Answers

This should really only be an issue with bool or int dtypes. float, object and datetime64[ns] can already hold NaN or NaT without changing the type.

Because of this, I'd recommend using the new nullable dtypes. You can use Int64 for your integer and 'boolean' for your Boolean columns. Both of these now support missing values with <NA>: pandas._libs.missing.NAType

import pandas as pd

df = pd.DataFrame({'a': [1]*6, 'b': [1, 2]*3, 'c': range(6)})
df2 = pd.DataFrame({'d': [1, 2], 'e': [True, False]})

df2['d'] = df2['d'].astype('Int64')
df2['e'] = df2['e'].astype('boolean')
df2.dtypes
#d      Int64
#e    boolean
#dtype: object

df.join(df2)
#   a  b  c     d      e
#0  1  1  0     1   True
#1  1  2  1     2  False
#2  1  1  2  <NA>   <NA>
#3  1  2  3  <NA>   <NA>
#4  1  1  4  <NA>   <NA>
#5  1  2  5  <NA>   <NA>

df.join(df2).dtypes
#a      int64
#b      int64
#c      int64
#d      Int64    <- dtype preserved
#e    boolean    <- dtype preserved

With Int64/Bool64 the fill value remains true to what you specify and the column is only upcast if you fill with a value incapable of fitting in the current dtype.

answered Oct 21 '22 07:10

ALollz

I don't think there's any really elegant/efficient way to do it. You could do it by tracking the original datatypes and then casting the columns after the merge, like this:

import pandas as pd

# all types are originally ints
df = pd.DataFrame({'a': [1]*10, 'b': [1, 2] * 5, 'c': range(10)})
df2 = pd.DataFrame({'e': [1, 1], 'd': [1, 2]})

# track the original dtypes
orig = df.dtypes.to_dict()
orig.update(df2.dtypes.to_dict())

# join the dataframe
joined = df.join(df2, how='outer')

# columns with nans are now float dtype
print joined.dtypes

# replace nans with suitable int value
joined.fillna(-1, inplace=True)

# re-cast the columns as their original dtype
joined_orig_types = joined.apply(lambda x: x.astype(orig[x.name]))

print joined_orig_types.dtypes

answered Oct 21 '22 07:10

hume

As of pandas 1.0.0 I believe you have another option, which is to first use convert_dtypes. This converts the dataframe columns to dtypes that support pd.NA, avoiding the issues with NaN. This preserves the bool values as well unlike this answer.

...

df = pd.DataFrame({'a': [1]*6, 'b': [1, 2]*3, 'c': range(6)})
df2 = pd.DataFrame({'d': [1,2], 'e': [True, False]})
df = df.convert_dtypes()
df2 = df2.convert_dtypes()
print(df.join(df2))

#   a  b  c     d      e
#0  1  1  0     1   True
#1  1  2  1     2  False
#2  1  1  2  <NA>   <NA>
#3  1  2  3  <NA>   <NA>
#4  1  1  4  <NA>   <NA>
#5  1  2  5  <NA>   <NA>

answered Oct 21 '22 07:10

totalhack

Or you can just do a concat/append on dtypes of both dfs and applyastype():

joined = df.join(df2, how='outer').fillna(-1).astype(pd.concat([df.dtypes,df2.dtypes]))
#or joined = df.join(df2, how='outer').fillna(-1).astype(df.dtypes.append(df2.dtypes))
print(joined)

   a  b  c  e  d
0  1  1  0  1  1
1  1  2  1  1  2
2  1  1  2 -1 -1
3  1  2  3 -1 -1
4  1  1  4 -1 -1
5  1  2  5 -1 -1
6  1  1  6 -1 -1
7  1  2  7 -1 -1
8  1  1  8 -1 -1
9  1  2  9 -1 -1

answered Oct 21 '22 06:10

anky

Related questions
                            
                                TypeError: descriptor 'strftime' requires a 'datetime.date' object but received a 'Text'
                            
                                Understanding "score" returned by scikit-learn KMeans
                            
                                How to interpret TensorFlow output?
                            
                                Sympy - Comparing expressions
                            
                                Replace all occurrences that match regular expression
                            
                                OSError: [Errno 8] Exec format error selenium
                            
                                Pandas populate new dataframe column based on matching columns in another dataframe
                            
                                Export pandas Styled table to image file
                            
                                What exactly is the definition of a 'Module' in PyTorch?
                            
                                What is the recommended way to break long if statement? (W504 line break after binary operator)
                            
                                OpenCV Image Processing -- C++ vs C vs Python
                            
                                How to calculate the statistics "t-test" with numpy
                            
                                Django Storage Backend for S3
                            
                                What is the scope of a random seed in Python?
                            
                                Convert "unknown format" strings to datetime objects?
                            
                                Factory method for objects - best practice?
                            
                                How to hide .pyc files when you enter `ls` in bash
                            
                                Django: Error: Unknown command: 'makemigrations'
                            
                                python logging: how to ensure logfile directory is created?
                            
                                NumPy "record array" or "structured array" or "recarray"

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Preserve Dataframe column data type after outer merge

Tags:

python

pandas

dataframe

pandas-merge

Jeff

People also ask

4 Answers

ALollz

hume

totalhack

anky

Recent Activity

Donate For Us