I have a data frame (data_train) with NaN values, A sample is given below: <pre class="prettyprint"><code>republican n y republican n NaN democrat NaN n democrat n y </code></pre> I want to replace all the NaN with some random values like . <pre class="prettyprint"><code>republican n y republican n rnd2 democrat rnd1 n democrat n y </code></pre> How do I do it. I tried the following, but had no luck: <pre class="prettyprint"><code>df_rand = pd.DataFrame(np.random.randn(data_train.shape[0],data_train.shape[1])) data_train[pd.isnull(data_train)] = dfrand[pd.isnull(data_train)] </code></pre> when I do the above with a dataframe with random numerical data the above script works fine.

You can use the pandas update command, this way: 1) Generate a random DataFrame with the same columns and index as the original one: <pre class="prettyprint"><code>import numpy as np; import pandas as pd M = len(df.index) N = len(df.columns) ran = pd.DataFrame(np.random.randn(M,N), columns=df.columns, index=df.index) </code></pre> 2) Then use <code>update</code>, so that the NaN values in <code>df</code> will be replaced by the generated random values <pre class="prettyprint"><code>df.update(ran) </code></pre> <hr> In the above example I used values from a standard normal, but you can also use values randomly picked from the original DataFrame: <pre class="prettyprint"><code>import numpy as np; import pandas as pd M = len(df.index) N = len(df.columns) val = np.ravel(df.values) val = val[~np.isnan(val)] val = np.random.choice(val, size=(M,N)) ran = pd.DataFrame(val, columns=df.columns, index=df.index) df.update(ran) </code></pre>

Replace NaN in a dataframe with random values

Tags:

python

pandas

I have a data frame (data_train) with NaN values, A sample is given below:

republican                n                          y   
republican                n                          NaN   
democrat                 NaN                         n
democrat                  n                          y

I want to replace all the NaN with some random values like .

republican                n                           y   
republican                n                          rnd2
democrat                 rnd1                         n
democrat                  n                           y

How do I do it.

I tried the following, but had no luck:

df_rand = pd.DataFrame(np.random.randn(data_train.shape[0],data_train.shape[1]))
data_train[pd.isnull(data_train)] = dfrand[pd.isnull(data_train)]

when I do the above with a dataframe with random numerical data the above script works fine.

759

asked Jun 04 '15 14:06

Sam

2 Answers

You can use the pandas update command, this way:

1) Generate a random DataFrame with the same columns and index as the original one:

import numpy as np; import pandas as pd
M = len(df.index)
N = len(df.columns)
ran = pd.DataFrame(np.random.randn(M,N), columns=df.columns, index=df.index)

2) Then use update, so that the NaN values in df will be replaced by the generated random values

df.update(ran)

In the above example I used values from a standard normal, but you can also use values randomly picked from the original DataFrame:

import numpy as np; import pandas as pd

M = len(df.index)
N = len(df.columns)

val = np.ravel(df.values)
val = val[~np.isnan(val)]
val = np.random.choice(val, size=(M,N))
ran = pd.DataFrame(val, columns=df.columns, index=df.index)

df.update(ran)

answered Sep 30 '22 12:09

Abramodj

Well, if you use fillna to fill the NaN, a random generator works only once and will fill all N/As with the same number.

So, make sure that a random number is generated and used each time. For a dataframe like this :

          Date         A       B
0   2015-01-01       NaN     NaN
1   2015-01-02       NaN     NaN
2   2015-01-03       NaN     NaN
3   2015-01-04       NaN     NaN
4   2015-01-05       NaN     NaN
5   2015-01-06       NaN     NaN
6   2015-01-07       NaN     NaN
7   2015-01-08       NaN     NaN
8   2015-01-09       NaN     NaN
9   2015-01-10       NaN     NaN
10  2015-01-11       NaN     NaN
11  2015-01-12       NaN     NaN
12  2015-01-13       NaN     NaN
13  2015-01-14       NaN     NaN
14  2015-01-15       NaN     NaN
15  2015-01-16       NaN     NaN

I used the following code to fill up the NaNs in column A:

import random
x['A'] = x['A'].apply(lambda v: random.random() * 1000)

Which will give us something like:

          Date           A       B
0   2015-01-01   96.538211     NaN
1   2015-01-02  404.683392     NaN
2   2015-01-03  849.614253     NaN
3   2015-01-04  590.030660     NaN
4   2015-01-05  203.167519     NaN
5   2015-01-06  980.508258     NaN
6   2015-01-07  221.088002     NaN
7   2015-01-08  285.013762     NaN

answered Sep 30 '22 12:09

fixxxer

Related questions
                            
                                AttributeError during Django-rest-framework tutorial 4: authentication
                            
                                Numbers of Day in Month
                            
                                python - getting the MAC address properly in Windows
                            
                                how to display openerp error message
                            
                                Import list variable from separate files in python
                            
                                xlsxwriter module won't open/close Excel file correctly
                            
                                How can I parse a dictionary string?
                            
                                Update primary key Django MySQL
                            
                                Python float precision float
                            
                                change data type of a array in python
                            
                                Set up a Django Project with Mamp?
                            
                                Match a whole word in a string using dynamic regex
                            
                                Difference between bytearray and list
                            
                                How to create non-blocking continuous reading from `stdin`?
                            
                                Add months to a datetime column in pandas
                            
                                Is manage.py collectstatic needed on each edit of static file?
                            
                                Local variable referenced before assignment, using multi-threading
                            
                                using variable in a url in python
                            
                                Error with encrypt message with RSA python
                            
                                Attribute error when attempting to get a value for field

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With