Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to prevent Pandas from converting my integers to floats when I merge two dataFrames?

Tags:

python

pandas

Here is my code:

import pandas as pd left = pd.DataFrame({'AID': [1, 2, 3, 4],                        'D': [2011, 2011,0, 2011],                        'R1': [0, 1, 0, 0],                        'R2': [1, 0, 0, 0]  })  right = pd.DataFrame({'AID': [1, 2, 3, 4],                        'D': [2012, 0,0, 2012],                        'R1': [0, 1, 0, 0],                        'R2': [1, 0, 0, 0]  })  result = left.merge(right, how = 'outer') 

When I print my result dataFrame, the integer values are now floats:

   AID       D   R1   R2 0  1.0  2011.0  0.0  1.0 1  2.0  2011.0  1.0  0.0 2  3.0     0.0  0.0  0.0 3  4.0  2011.0  0.0  0.0 4  1.0  2012.0  0.0  1.0 5  2.0     0.0  1.0  0.0 6  4.0  2012.0  0.0  0.0 

How do I prevent this?

like image 892
Rakesh Adhikesavan Avatar asked Jul 18 '16 19:07

Rakesh Adhikesavan


People also ask

How avoid duplicates in pandas merge?

Use the drop() function to remove the columns with the suffix 'remove'. This will ensure that identical columns don't exist in the new dataframe.

How do I merge two data frames in pandas?

The concat() function can be used to concatenate two Dataframes by adding the rows of one to the other. The merge() function is equivalent to the SQL JOIN clause. 'left', 'right' and 'inner' joins are all possible.

How do I merge two Dataframes in pandas with common column?

To merge two Pandas DataFrame with common column, use the merge() function and set the ON parameter as the column name.


2 Answers

This bug was fixed in pandas v0.19.0.:

Merging will now preserve the dtype of the join keys

but note you can convert all columns in a dataframe to int dtype with:

result = result.astype(int) 

This behaviour does still occur if there are unmatched records in the join, and hence NaNs in the results. In this case, you need to change the dtype to the extension type 'Int64' to handle the NaNs:

result = result.astype('Int64') 
like image 100
iacob Avatar answered Oct 08 '22 09:10

iacob


you can cast the float back to an using

 result = left.merge(right, on='AID', how = 'outer')   result['D_x']=result['D_x'].astype('Int64')  result['R1_x']=result['R1_x'].astype('Int64')  result['R2_x']=result['R2_x'].astype('Int64')  result['D_y']=result['D_y'].astype('Int64')  result['R1_y']=result['R1_y'].astype('Int64')  result['R2_y']=result['R2_y'].astype('Int64') 

if the data has null or missing data

import numbers import math  left = pd.DataFrame({'AID': [1, 2, 3, 4],                    'D': [2011, 2011,0, 2011],                    'R1': [0, 1, 0, 0],                    'R2': [1, 0, 0, 0]  })  right = pd.DataFrame({'AID': [1, 2, 3, 4],                    'D': [2012, 0,0, 2012],                    'R1': [0, 1, 0, 0],                    'R2': [1, 0, 0, 0]  })  result = left.merge(right, how = 'outer') result['AID']=[int(val) if isinstance(val,numbers.Number) & (math.isnan(val)==False) else 0 for val in result['AID']] result['D']=[int(val) if isinstance(val,numbers.Number) & (math.isnan(val)==False) else 0 for val in result['D']] result['R1']=[int(val) if isinstance(val,numbers.Number) & (math.isnan(val)==False) else 0 for val in result['R1']] result['R2']=[int(val) if isinstance(val,numbers.Number) & (math.isnan(val)==False) else 0 for val in result['R2']]   print(result)  print(result.isna()) 

Output

AID     D  R1  R2 0    1  2011   0   1 1    2  2011   1   0 2    3     0   0   0 3    4  2011   0   0 4    1  2012   0   1 5    2     0   1   0 6    4  2012   0   0     AID      D     R1     R2 0  False  False  False  False 1  False  False  False  False 2  False  False  False  False 3  False  False  False  False 4  False  False  False  False 5  False  False  False  False 6  False  False  False  False     

you can then replace the nan values with either: mean, 0, or interpolation value

Fixing Column D

def interpolate_list(y):     idx = np.nonzero(y)     x = np.arange(len(y))     interp = interp1d(x[idx],y[idx])    new_values = interp(x)    return new_values  interp_d=interpolate_list(np.array(result['D'])) data=list(zip(interp_d,result['D'])) result['D']=[item[0] if item[1]==0 else item[1] for item in data] print(result) 

Output

   AID       D  R1  R2 0    1  2011.0   0   1 1    2  2011.0   1   0 2    3  2011.0   0   0 3    4  2011.0   0   0 4    1  2012.0   0   1 5    2  2012.0   1   0 6    4  2012.0   0   0 
like image 45
Golden Lion Avatar answered Oct 08 '22 07:10

Golden Lion