Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

pandas concat generates nan values

I am curious why a simple concatenation of two data frames in pandas:

shape: (66441, 1) dtypes: prediction    int64 dtype: object isnull().sum(): prediction    0 dtype: int64  shape: (66441, 1) CUSTOMER_ID    int64 dtype: object isnull().sum() CUSTOMER_ID    0 dtype: int64 

of the same shape and both without NaN values

foo = pd.concat([initId, ypred], join='outer', axis=1) print(foo.shape) print(foo.isnull().sum()) 

can result in a lot of NaN values if joined.

(83384, 2) CUSTOMER_ID    16943 prediction     16943 

How can I fix this problem and prevent NaN values being introduced?

Trying to reproduce it like

aaa  = pd.DataFrame([0,1,0,1,0,0], columns=['prediction']) print(aaa) bbb  = pd.DataFrame([0,0,1,0,1,1], columns=['groundTruth']) print(bbb) pd.concat([aaa, bbb], axis=1) 

failed e.g. worked just fine as no NaN values were introduced.

like image 920
Georg Heiler Avatar asked Oct 31 '16 09:10

Georg Heiler


People also ask

Why am I getting NaN in pandas?

In applied data science, you will usually have missing data. For example, an industrial application with sensors will have sensor data that is missing on certain days. You have a couple of alternatives to work with missing data.

How do I fix NaN in pandas?

If you want to treat the value as a missing value, you can use the replace() method to replace it with float('nan') , np. nan , and math. nan .

How do you fix NaN in Python?

Deleting the row with missing data If there is a certain row with missing data, then you can delete the entire row with all the features in that row. axis=1 is used to drop the column with `NaN` values. axis=0 is used to drop the row with `NaN` values.

How do I concatenate values in pandas?

By use + operator simply you can concatenate two or multiple text/string columns in pandas DataFrame. Note that when you apply + operator on numeric columns it actually does addition instead of concatenation.


1 Answers

I think there is problem with different index values, so where concat cannot align get NaN:

aaa  = pd.DataFrame([0,1,0,1,0,0], columns=['prediction'], index=[4,5,8,7,10,12]) print(aaa)     prediction 4            0 5            1 8            0 7            1 10           0 12           0  bbb  = pd.DataFrame([0,0,1,0,1,1], columns=['groundTruth']) print(bbb)    groundTruth 0            0 1            0 2            1 3            0 4            1 5            1  print (pd.concat([aaa, bbb], axis=1))     prediction  groundTruth 0          NaN          0.0 1          NaN          0.0 2          NaN          1.0 3          NaN          0.0 4          0.0          1.0 5          1.0          1.0 7          1.0          NaN 8          0.0          NaN 10         0.0          NaN 12         0.0          NaN 

Solution is reset_index if indexes values are not necessary:

aaa.reset_index(drop=True, inplace=True) bbb.reset_index(drop=True, inplace=True)  print(aaa)    prediction 0           0 1           1 2           0 3           1 4           0 5           0  print(bbb)    groundTruth 0            0 1            0 2            1 3            0 4            1 5            1  print (pd.concat([aaa, bbb], axis=1))    prediction  groundTruth 0           0            0 1           1            0 2           0            1 3           1            0 4           0            1 5           0            1 

EDIT: If need same index like aaa and length of DataFrames is same use:

bbb.index = aaa.index print (pd.concat([aaa, bbb], axis=1))     prediction  groundTruth 4            0            0 5            1            0 8            0            1 7            1            0 10           0            1 12           0            1 
like image 143
jezrael Avatar answered Sep 19 '22 04:09

jezrael