Here is my code:
import pandas as pd left = pd.DataFrame({'AID': [1, 2, 3, 4], 'D': [2011, 2011,0, 2011], 'R1': [0, 1, 0, 0], 'R2': [1, 0, 0, 0] }) right = pd.DataFrame({'AID': [1, 2, 3, 4], 'D': [2012, 0,0, 2012], 'R1': [0, 1, 0, 0], 'R2': [1, 0, 0, 0] }) result = left.merge(right, how = 'outer')
When I print my result dataFrame, the integer values are now floats:
AID D R1 R2 0 1.0 2011.0 0.0 1.0 1 2.0 2011.0 1.0 0.0 2 3.0 0.0 0.0 0.0 3 4.0 2011.0 0.0 0.0 4 1.0 2012.0 0.0 1.0 5 2.0 0.0 1.0 0.0 6 4.0 2012.0 0.0 0.0
How do I prevent this?
Use the drop() function to remove the columns with the suffix 'remove'. This will ensure that identical columns don't exist in the new dataframe.
The concat() function can be used to concatenate two Dataframes by adding the rows of one to the other. The merge() function is equivalent to the SQL JOIN clause. 'left', 'right' and 'inner' joins are all possible.
To merge two Pandas DataFrame with common column, use the merge() function and set the ON parameter as the column name.
This bug was fixed in pandas v0.19.0.:
Merging will now preserve the dtype of the join keys
but note you can convert all columns in a dataframe to int
dtype with:
result = result.astype(int)
This behaviour does still occur if there are unmatched records in the join, and hence NaN
s in the results. In this case, you need to change the dtype to the extension type 'Int64'
to handle the NaN
s:
result = result.astype('Int64')
you can cast the float back to an using
result = left.merge(right, on='AID', how = 'outer') result['D_x']=result['D_x'].astype('Int64') result['R1_x']=result['R1_x'].astype('Int64') result['R2_x']=result['R2_x'].astype('Int64') result['D_y']=result['D_y'].astype('Int64') result['R1_y']=result['R1_y'].astype('Int64') result['R2_y']=result['R2_y'].astype('Int64')
if the data has null or missing data
import numbers import math left = pd.DataFrame({'AID': [1, 2, 3, 4], 'D': [2011, 2011,0, 2011], 'R1': [0, 1, 0, 0], 'R2': [1, 0, 0, 0] }) right = pd.DataFrame({'AID': [1, 2, 3, 4], 'D': [2012, 0,0, 2012], 'R1': [0, 1, 0, 0], 'R2': [1, 0, 0, 0] }) result = left.merge(right, how = 'outer') result['AID']=[int(val) if isinstance(val,numbers.Number) & (math.isnan(val)==False) else 0 for val in result['AID']] result['D']=[int(val) if isinstance(val,numbers.Number) & (math.isnan(val)==False) else 0 for val in result['D']] result['R1']=[int(val) if isinstance(val,numbers.Number) & (math.isnan(val)==False) else 0 for val in result['R1']] result['R2']=[int(val) if isinstance(val,numbers.Number) & (math.isnan(val)==False) else 0 for val in result['R2']] print(result) print(result.isna())
Output
AID D R1 R2 0 1 2011 0 1 1 2 2011 1 0 2 3 0 0 0 3 4 2011 0 0 4 1 2012 0 1 5 2 0 1 0 6 4 2012 0 0 AID D R1 R2 0 False False False False 1 False False False False 2 False False False False 3 False False False False 4 False False False False 5 False False False False 6 False False False False
you can then replace the nan values with either: mean, 0, or interpolation value
Fixing Column D
def interpolate_list(y): idx = np.nonzero(y) x = np.arange(len(y)) interp = interp1d(x[idx],y[idx]) new_values = interp(x) return new_values interp_d=interpolate_list(np.array(result['D'])) data=list(zip(interp_d,result['D'])) result['D']=[item[0] if item[1]==0 else item[1] for item in data] print(result)
Output
AID D R1 R2 0 1 2011.0 0 1 1 2 2011.0 1 0 2 3 2011.0 0 0 3 4 2011.0 0 0 4 1 2012.0 0 1 5 2 2012.0 1 0 6 4 2012.0 0 0
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With