I am trying to fill NaN values in a dataframe with values coming from a standard normal distribution. This is currently my code:
sqlStatement = "select * from sn.clustering_normalized_dataset"
df = psql.frame_query(sqlStatement, cnx)
data=df.pivot("user","phrase","tfw")
dfrand = pd.DataFrame(data=np.random.randn(data.shape[0],data.shape[1]))
data[np.isnan(data)] = dfrand[np.isnan(data)]
After pivoting the dataframe 'data' it looks like that:
phrase aaron abbas abdul abe able abroad abu abuse \
user
14233664 NaN NaN NaN NaN NaN NaN NaN NaN
52602716 NaN NaN NaN NaN NaN NaN NaN NaN
123456789 NaN NaN NaN NaN NaN NaN NaN NaN
500158258 NaN NaN NaN NaN NaN NaN NaN NaN
517187571 0.4 NaN NaN 0.142857 1 0.4 0.181818 NaN
However, I need that each NaN value will be replaced with a new random value. So I created a new df consists of only random values (dfrand) and then trying to swap the missing numbers (Nan) by the values from dfrand corresponding to indices of the NaN's. Well - unfortunately it doesn't work - Although the expression
np.isnan(data)
returns a dataframe consists of True and False values, the expression
dfrand[np.isnan(data)]
return only NaN values so the overall trick doesn't work. Any ideas what the issue ?
Pandas DataFrame fillna() MethodThe fillna() method replaces the NULL values with a specified value. The fillna() method returns a new DataFrame object unless the inplace parameter is set to True , in that case the fillna() method does the replacing in the original DataFrame instead.
The fillna() function is used to fill NA/NaN values using the specified method. Value to use to fill holes (e.g. 0), alternately a dict/Series/DataFrame of values specifying which value to use for each index (for a Series) or column (for a DataFrame). Values not in the dict/Series/DataFrame will not be filled.
Three-thousand columns is not so many. How many rows do you have? You could always make a random dataframe of the same size and do a logical replacement (the size of your dataframe will dictate whether this is feasible or not.
if you know the size of your dataframe:
import pandas as pd
import numpy as np
# create random dummy dataframe
dfrand = pd.DataFrame(data=np.random.randn(rows,cols))
# import "real" dataframe
data = pd.read_csv(etc.) # or however you choose to read it in
# replace nans
data[np.isnan(data)] = dfrand[np.isnan(data)]
if you do not know the size of your dataframe, just shuffle things around
import pandas as pd
import numpy as np
# import "real" dataframe
data = pd.read_csv(etc.) # or however you choose to read it in
# create random dummy dataframe
dfrand = pd.DataFrame(data=np.random.randn(data.shape[0],data.shape[1]))
# replace nans
data[np.isnan(data)] = dfrand[np.isnan(data)]
EDIT Per "users" last comment: "dfrand[np.isnan(data)] returns NaN only."
Right! And that is exactly what you wanted. In my solution I have: data[np.isnan(data)] = dfrand[np.isnan(data)]. Translated, this means: take the randomly-generated value from dfrand that corresponds to the NaN-location within "data" and insert it in "data" where "data" is NaN. An example will help:
a = pd.DataFrame(data=np.random.randint(0,100,(10,3)))
a[0][5] = np.nan
In [32]: a
Out[33]:
0 1 2
0 2 26 28
1 14 79 82
2 89 32 59
3 65 47 31
4 29 59 15
5 NaN 58 90
6 15 66 60
7 10 19 96
8 90 26 92
9 0 19 23
# define randomly-generated dataframe, much like what you are doing, and replace NaN's
b = pd.DataFrame(data=np.random.randint(0,100,(10,3)))
In [39]: b
Out[39]:
0 1 2
0 92 21 55
1 65 53 89
2 54 98 97
3 48 87 79
4 98 38 62
5 46 16 30
6 95 39 70
7 90 59 9
8 14 85 37
9 48 29 46
a[np.isnan(a)] = b[np.isnan(a)]
In [38]: a
Out[38]:
0 1 2
0 2 26 28
1 14 79 82
2 89 32 59
3 65 47 31
4 29 59 15
5 46 58 90
6 15 66 60
7 10 19 96
8 90 26 92
9 0 19 23
As you can see, all NaN's in have been replaced with the randomly-generated value in based on 's nan-value indices.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With