I am curious why a simple concatenation of two data frames in pandas:
shape: (66441, 1) dtypes: prediction int64 dtype: object isnull().sum(): prediction 0 dtype: int64 shape: (66441, 1) CUSTOMER_ID int64 dtype: object isnull().sum() CUSTOMER_ID 0 dtype: int64
of the same shape and both without NaN values
foo = pd.concat([initId, ypred], join='outer', axis=1) print(foo.shape) print(foo.isnull().sum())
can result in a lot of NaN values if joined.
(83384, 2) CUSTOMER_ID 16943 prediction 16943
Trying to reproduce it like
aaa = pd.DataFrame([0,1,0,1,0,0], columns=['prediction']) print(aaa) bbb = pd.DataFrame([0,0,1,0,1,1], columns=['groundTruth']) print(bbb) pd.concat([aaa, bbb], axis=1)
failed e.g. worked just fine as no NaN values were introduced.
In applied data science, you will usually have missing data. For example, an industrial application with sensors will have sensor data that is missing on certain days. You have a couple of alternatives to work with missing data.
If you want to treat the value as a missing value, you can use the replace() method to replace it with float('nan') , np. nan , and math. nan .
Deleting the row with missing data If there is a certain row with missing data, then you can delete the entire row with all the features in that row. axis=1 is used to drop the column with `NaN` values. axis=0 is used to drop the row with `NaN` values.
By use + operator simply you can concatenate two or multiple text/string columns in pandas DataFrame. Note that when you apply + operator on numeric columns it actually does addition instead of concatenation.
I think there is problem with different index values, so where concat
cannot align get NaN
:
aaa = pd.DataFrame([0,1,0,1,0,0], columns=['prediction'], index=[4,5,8,7,10,12]) print(aaa) prediction 4 0 5 1 8 0 7 1 10 0 12 0 bbb = pd.DataFrame([0,0,1,0,1,1], columns=['groundTruth']) print(bbb) groundTruth 0 0 1 0 2 1 3 0 4 1 5 1 print (pd.concat([aaa, bbb], axis=1)) prediction groundTruth 0 NaN 0.0 1 NaN 0.0 2 NaN 1.0 3 NaN 0.0 4 0.0 1.0 5 1.0 1.0 7 1.0 NaN 8 0.0 NaN 10 0.0 NaN 12 0.0 NaN
Solution is reset_index
if indexes values are not necessary:
aaa.reset_index(drop=True, inplace=True) bbb.reset_index(drop=True, inplace=True) print(aaa) prediction 0 0 1 1 2 0 3 1 4 0 5 0 print(bbb) groundTruth 0 0 1 0 2 1 3 0 4 1 5 1 print (pd.concat([aaa, bbb], axis=1)) prediction groundTruth 0 0 0 1 1 0 2 0 1 3 1 0 4 0 1 5 0 1
EDIT: If need same index like aaa
and length of DataFrames is same use:
bbb.index = aaa.index print (pd.concat([aaa, bbb], axis=1)) prediction groundTruth 4 0 0 5 1 0 8 0 1 7 1 0 10 0 1 12 0 1
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With