I have a dataframe which contains nan values at few places. I am trying to perform data cleaning in which I fill the nan values with mean of it's previous five instances. To do so, I have come up with the following.
input_data_frame[var_list].fillna(input_data_frame[var_list].rolling(5).mean(), inplace=True)
But, this is not working. It isn't filling the nan values. There is no change in the dataframe's null count before and after the above operation. Assuming I have a dataframe with just integer column, How can I fill NaN values with mean of the previous five instances? Thanks in advance.
You can use the fillna () function to replace NaN values in a pandas DataFrame. The following code shows how to fill the NaN values in the rating column with the mean value of the rating column: The mean value in the rating column was 85.125 so each of the NaN values in the rating column were filled with this value.
Now with the help of fillna () function we will change all ‘NaN’ of that particular column for which we have its mean. We will print the updated column. Syntax: df.fillna (value=None, method=None, axis=None, inplace=False, limit=None, downcast=None, **kwargs)
It replaces missing values with the most frequent ones in that column. Let’s see an example of replacing NaN values of “Color” column –
If you would like to avoid returning NaN, you could pass min_periods=1 to the method which reduces the minimum required number of valid observations in the window to 1 instead of 2: Show activity on this post. Using min_periods=1 can lead to high variance for the values in the rolling window.
rolling_mean
function has been modified in pandas. If you fill the entire dataset, you can use;
filled_dataset = dataset.fillna(dataset.rolling(6,min_periods=1).mean())
This should work:
input_data_frame[var_list]= input_data_frame[var_list].fillna(pd.rolling_mean(input_data_frame[var_list], 6, min_periods=1))
Note that the window
is 6
because it includes the value of NaN
itself (which is not counted in the average). Also the other NaN
values are not used for the averages, so if less that 5 values are found in the window, the average is calculated on the actual values.
Example:
df = {'a': [1, 1,2,3,4,5, np.nan, 1, 1, 2, 3, 4, 5, np.nan] }
df = pd.DataFrame(data=df)
print df
a
0 1.0
1 1.0
2 2.0
3 3.0
4 4.0
5 5.0
6 NaN
7 1.0
8 1.0
9 2.0
10 3.0
11 4.0
12 5.0
13 NaN
Output:
a
0 1.0
1 1.0
2 2.0
3 3.0
4 4.0
5 5.0
6 3.0
7 1.0
8 1.0
9 2.0
10 3.0
11 4.0
12 5.0
13 3.0
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With