Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Replace NaN in a dataframe with random values

Tags:

python

pandas

I have a data frame (data_train) with NaN values, A sample is given below:

republican                n                          y   
republican                n                          NaN   
democrat                 NaN                         n
democrat                  n                          y   

I want to replace all the NaN with some random values like .

republican                n                           y   
republican                n                          rnd2
democrat                 rnd1                         n
democrat                  n                           y   

How do I do it.

I tried the following, but had no luck:

df_rand = pd.DataFrame(np.random.randn(data_train.shape[0],data_train.shape[1]))
data_train[pd.isnull(data_train)] = dfrand[pd.isnull(data_train)]

when I do the above with a dataframe with random numerical data the above script works fine.

like image 759
Sam Avatar asked Jun 04 '15 14:06

Sam


People also ask

How do I replace NaN with nothing?

We can replace the NaN with an empty string using df. replace() function. This function will replace an empty string inplace of the NaN value.

How do I get rid of NaN in pandas?

By using dropna() method you can drop rows with NaN (Not a Number) and None values from pandas DataFrame. Note that by default it returns the copy of the DataFrame after removing rows. If you wanted to remove from the existing DataFrame, you should use inplace=True .


2 Answers

You can use the pandas update command, this way:

1) Generate a random DataFrame with the same columns and index as the original one:

import numpy as np; import pandas as pd
M = len(df.index)
N = len(df.columns)
ran = pd.DataFrame(np.random.randn(M,N), columns=df.columns, index=df.index)

2) Then use update, so that the NaN values in df will be replaced by the generated random values

df.update(ran)

In the above example I used values from a standard normal, but you can also use values randomly picked from the original DataFrame:

import numpy as np; import pandas as pd

M = len(df.index)
N = len(df.columns)

val = np.ravel(df.values)
val = val[~np.isnan(val)]
val = np.random.choice(val, size=(M,N))
ran = pd.DataFrame(val, columns=df.columns, index=df.index)

df.update(ran)
like image 93
Abramodj Avatar answered Sep 30 '22 12:09

Abramodj


Well, if you use fillna to fill the NaN, a random generator works only once and will fill all N/As with the same number.

So, make sure that a random number is generated and used each time. For a dataframe like this :

          Date         A       B
0   2015-01-01       NaN     NaN
1   2015-01-02       NaN     NaN
2   2015-01-03       NaN     NaN
3   2015-01-04       NaN     NaN
4   2015-01-05       NaN     NaN
5   2015-01-06       NaN     NaN
6   2015-01-07       NaN     NaN
7   2015-01-08       NaN     NaN
8   2015-01-09       NaN     NaN
9   2015-01-10       NaN     NaN
10  2015-01-11       NaN     NaN
11  2015-01-12       NaN     NaN
12  2015-01-13       NaN     NaN
13  2015-01-14       NaN     NaN
14  2015-01-15       NaN     NaN
15  2015-01-16       NaN     NaN

I used the following code to fill up the NaNs in column A:

import random
x['A'] = x['A'].apply(lambda v: random.random() * 1000)

Which will give us something like:

          Date           A       B
0   2015-01-01   96.538211     NaN
1   2015-01-02  404.683392     NaN
2   2015-01-03  849.614253     NaN
3   2015-01-04  590.030660     NaN
4   2015-01-05  203.167519     NaN
5   2015-01-06  980.508258     NaN
6   2015-01-07  221.088002     NaN
7   2015-01-08  285.013762     NaN
like image 30
fixxxer Avatar answered Sep 30 '22 12:09

fixxxer