Here is my code: <pre class="prettyprint"><code>import pandas as pd left = pd.DataFrame({'AID': [1, 2, 3, 4], 'D': [2011, 2011,0, 2011], 'R1': [0, 1, 0, 0], 'R2': [1, 0, 0, 0] }) right = pd.DataFrame({'AID': [1, 2, 3, 4], 'D': [2012, 0,0, 2012], 'R1': [0, 1, 0, 0], 'R2': [1, 0, 0, 0] }) result = left.merge(right, how = 'outer') </code></pre> When I print my result dataFrame, the integer values are now floats: <pre class="prettyprint"><code> AID D R1 R2 0 1.0 2011.0 0.0 1.0 1 2.0 2011.0 1.0 0.0 2 3.0 0.0 0.0 0.0 3 4.0 2011.0 0.0 0.0 4 1.0 2012.0 0.0 1.0 5 2.0 0.0 1.0 0.0 6 4.0 2012.0 0.0 0.0 </code></pre> How do I prevent this?

This bug was fixed in pandas v0.19.0.: <blockquote> Merging will now preserve the dtype of the join keys </blockquote> but note you can convert all columns in a dataframe to <code>int</code> dtype with: <pre class="prettyprint lang-py prettyprint-override"><code>result = result.astype(int) </code></pre> <hr> This behaviour does still occur if there are unmatched records in the join, and hence <code>NaN</code>s in the results. In this case, you need to change the dtype to the extension type <code>'Int64'</code> to handle the <code>NaN</code>s: <pre class="prettyprint lang-py prettyprint-override"><code>result = result.astype('Int64') </code></pre>

you can cast the float back to an using <pre class="prettyprint"><code> result = left.merge(right, on='AID', how = 'outer') result['D_x']=result['D_x'].astype('Int64') result['R1_x']=result['R1_x'].astype('Int64') result['R2_x']=result['R2_x'].astype('Int64') result['D_y']=result['D_y'].astype('Int64') result['R1_y']=result['R1_y'].astype('Int64') result['R2_y']=result['R2_y'].astype('Int64') </code></pre> if the data has null or missing data <pre class="prettyprint"><code>import numbers import math left = pd.DataFrame({'AID': [1, 2, 3, 4], 'D': [2011, 2011,0, 2011], 'R1': [0, 1, 0, 0], 'R2': [1, 0, 0, 0] }) right = pd.DataFrame({'AID': [1, 2, 3, 4], 'D': [2012, 0,0, 2012], 'R1': [0, 1, 0, 0], 'R2': [1, 0, 0, 0] }) result = left.merge(right, how = 'outer') result['AID']=[int(val) if isinstance(val,numbers.Number) & (math.isnan(val)==False) else 0 for val in result['AID']] result['D']=[int(val) if isinstance(val,numbers.Number) & (math.isnan(val)==False) else 0 for val in result['D']] result['R1']=[int(val) if isinstance(val,numbers.Number) & (math.isnan(val)==False) else 0 for val in result['R1']] result['R2']=[int(val) if isinstance(val,numbers.Number) & (math.isnan(val)==False) else 0 for val in result['R2']] print(result) print(result.isna()) </code></pre> Output <pre class="prettyprint"><code>AID D R1 R2 0 1 2011 0 1 1 2 2011 1 0 2 3 0 0 0 3 4 2011 0 0 4 1 2012 0 1 5 2 0 1 0 6 4 2012 0 0 AID D R1 R2 0 False False False False 1 False False False False 2 False False False False 3 False False False False 4 False False False False 5 False False False False 6 False False False False </code></pre> you can then replace the nan values with either: mean, 0, or interpolation value Fixing Column D <pre class="prettyprint"><code>def interpolate_list(y): idx = np.nonzero(y) x = np.arange(len(y)) interp = interp1d(x[idx],y[idx]) new_values = interp(x) return new_values interp_d=interpolate_list(np.array(result['D'])) data=list(zip(interp_d,result['D'])) result['D']=[item[0] if item[1]==0 else item[1] for item in data] print(result) </code></pre> Output <pre class="prettyprint"><code> AID D R1 R2 0 1 2011.0 0 1 1 2 2011.0 1 0 2 3 2011.0 0 0 3 4 2011.0 0 0 4 1 2012.0 0 1 5 2 2012.0 1 0 6 4 2012.0 0 0 </code></pre>

How to prevent Pandas from converting my integers to floats when I merge two dataFrames?

Tags:

python

pandas

Here is my code:

import pandas as pd left = pd.DataFrame({'AID': [1, 2, 3, 4],                        'D': [2011, 2011,0, 2011],                        'R1': [0, 1, 0, 0],                        'R2': [1, 0, 0, 0]  })  right = pd.DataFrame({'AID': [1, 2, 3, 4],                        'D': [2012, 0,0, 2012],                        'R1': [0, 1, 0, 0],                        'R2': [1, 0, 0, 0]  })  result = left.merge(right, how = 'outer')

When I print my result dataFrame, the integer values are now floats:

   AID       D   R1   R2 0  1.0  2011.0  0.0  1.0 1  2.0  2011.0  1.0  0.0 2  3.0     0.0  0.0  0.0 3  4.0  2011.0  0.0  0.0 4  1.0  2012.0  0.0  1.0 5  2.0     0.0  1.0  0.0 6  4.0  2012.0  0.0  0.0

How do I prevent this?

892

asked Jul 18 '16 19:07

Rakesh Adhikesavan

2 Answers

This bug was fixed in pandas v0.19.0.:

Merging will now preserve the dtype of the join keys

but note you can convert all columns in a dataframe to int dtype with:

result = result.astype(int)

This behaviour does still occur if there are unmatched records in the join, and hence NaNs in the results. In this case, you need to change the dtype to the extension type 'Int64' to handle the NaNs:

result = result.astype('Int64')

100

answered Oct 08 '22 09:10

iacob

you can cast the float back to an using

 result = left.merge(right, on='AID', how = 'outer')   result['D_x']=result['D_x'].astype('Int64')  result['R1_x']=result['R1_x'].astype('Int64')  result['R2_x']=result['R2_x'].astype('Int64')  result['D_y']=result['D_y'].astype('Int64')  result['R1_y']=result['R1_y'].astype('Int64')  result['R2_y']=result['R2_y'].astype('Int64')

if the data has null or missing data

import numbers import math  left = pd.DataFrame({'AID': [1, 2, 3, 4],                    'D': [2011, 2011,0, 2011],                    'R1': [0, 1, 0, 0],                    'R2': [1, 0, 0, 0]  })  right = pd.DataFrame({'AID': [1, 2, 3, 4],                    'D': [2012, 0,0, 2012],                    'R1': [0, 1, 0, 0],                    'R2': [1, 0, 0, 0]  })  result = left.merge(right, how = 'outer') result['AID']=[int(val) if isinstance(val,numbers.Number) & (math.isnan(val)==False) else 0 for val in result['AID']] result['D']=[int(val) if isinstance(val,numbers.Number) & (math.isnan(val)==False) else 0 for val in result['D']] result['R1']=[int(val) if isinstance(val,numbers.Number) & (math.isnan(val)==False) else 0 for val in result['R1']] result['R2']=[int(val) if isinstance(val,numbers.Number) & (math.isnan(val)==False) else 0 for val in result['R2']]   print(result)  print(result.isna())

Output

AID     D  R1  R2 0    1  2011   0   1 1    2  2011   1   0 2    3     0   0   0 3    4  2011   0   0 4    1  2012   0   1 5    2     0   1   0 6    4  2012   0   0     AID      D     R1     R2 0  False  False  False  False 1  False  False  False  False 2  False  False  False  False 3  False  False  False  False 4  False  False  False  False 5  False  False  False  False 6  False  False  False  False

you can then replace the nan values with either: mean, 0, or interpolation value

Fixing Column D

def interpolate_list(y):     idx = np.nonzero(y)     x = np.arange(len(y))     interp = interp1d(x[idx],y[idx])    new_values = interp(x)    return new_values  interp_d=interpolate_list(np.array(result['D'])) data=list(zip(interp_d,result['D'])) result['D']=[item[0] if item[1]==0 else item[1] for item in data] print(result)

Output

   AID       D  R1  R2 0    1  2011.0   0   1 1    2  2011.0   1   0 2    3  2011.0   0   0 3    4  2011.0   0   0 4    1  2012.0   0   1 5    2  2012.0   1   0 6    4  2012.0   0   0

answered Oct 08 '22 07:10

Golden Lion

Related questions
                            
                                Python randomly generated IP address as string
                            
                                Ways to invoke python and Spyder on OSX
                            
                                isinstance() and issubclass() return conflicting results
                            
                                Making a method private in a python subclass
                            
                                Python Fibonacci Generator
                            
                                Adding BOM (unicode signature) while saving file in python
                            
                                TypeError: sequence item 0: expected string, NoneType found
                            
                                'module' object has no attribute 'now' will trying to create a CSV
                            
                                Beautiful Soup Using Regex to Find Tags?
                            
                                How to get Network Interface Card names in Python?
                            
                                How to store data in GCS while accessing it from GAE and 'GCE' locally
                            
                                Theano with Keras on Raspberry Pi
                            
                                Django rest APIs, automate documentation?
                            
                                How to swap keys for values in a dictionary [duplicate]
                            
                                How to persist patsy DesignInfo?
                            
                                How do I configure the behavior of the Qt4Agg backend?
                            
                                Django & Redis: How do I properly use connection pooling?
                            
                                R internal handling of sparse matrices
                            
                                Multiple independent embedded Python Interpreters on multiple operating system threads invoked from C/C++ program
                            
                                empty dictionary as default value for keyword argument in python function: dictionary seems to not be initialised to {} on subsequent calls? [duplicate]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With