I have a pandas dataframe with monthly data that I want to compute a 12 months moving average for. Data for for every month of January is missing, however (NaN), so I am using <pre class="prettyprint"><code>pd.rolling_mean(data["variable"]), 12, center=True) </code></pre> but it just gives me all NaN values. Is there a simple way that I can ignore the NaN values? I understand that in practice this would become a 11-month moving average. The dataframe has other variables which have January data, so I don't want to just throw out the January columns and do an 11 month moving average.

There are several ways to approach this, and the best way will depend on whether the January data is systematically different from other months. Most real-world data is likely to be somewhat seasonal, so let's use the average high temperature (Fahrenheit) of a random city in the northern hemisphere as an example. <pre class="prettyprint"><code>df=pd.DataFrame({ 'month' : [10,11,12,1,2,3], 'temp' : [65,50,45,np.nan,40,43] }).set_index('month') </code></pre> You could use a rolling mean as you suggest, but the issue is that you will get an average temperature over the entire year, which ignores the fact that January is the coldest month. To correct for this, you could reduce the window to 3, which results in the January temp being the average of the December and February temps. (I am also using <code>min_periods=1</code> as suggested in @user394430's answer.) <pre class="prettyprint"><code>df['rollmean12'] = df['temp'].rolling(12,center=True,min_periods=1).mean() df['rollmean3'] = df['temp'].rolling( 3,center=True,min_periods=1).mean() </code></pre> Those are improvements but still have the problem of overwriting existing values with rolling means. To avoid this you could combine with the <code>update()</code> method (see documentation here). <pre class="prettyprint"><code>df['update'] = df['rollmean3'] df['update'].update( df['temp'] ) # note: this is an inplace operation </code></pre> There are even simpler approaches that leave the existing values alone while filling the missing January temps with either the previous month, next month, or the mean of the previous and next month. <pre class="prettyprint"><code>df['ffill'] = df['temp'].ffill() # previous month df['bfill'] = df['temp'].bfill() # next month df['interp'] = df['temp'].interpolate() # mean of prev/next </code></pre> In this case, <code>interpolate()</code> defaults to simple linear interpretation, but you have several other intepolation options also. See documentation on pandas interpolate for more info. Or this statck overflow question: Interpolation on DataFrame in pandas Here is the sample data with all the results: <pre class="prettyprint"><code> temp rollmean12 rollmean3 update ffill bfill interp month 10 65.0 48.6 57.500000 65.0 65.0 65.0 65.0 11 50.0 48.6 53.333333 50.0 50.0 50.0 50.0 12 45.0 48.6 47.500000 45.0 45.0 45.0 45.0 1 NaN 48.6 42.500000 42.5 45.0 40.0 42.5 2 40.0 48.6 41.500000 40.0 40.0 40.0 40.0 3 43.0 48.6 41.500000 43.0 43.0 43.0 43.0 </code></pre> In particular, note that "update" and "interp" give the same results in all months. While it doesn't matter which one you use here, in other cases one way or the other might be better.

Replace NaN or missing values with rolling mean or other interpolation

Tags:

python

pandas

missing-data

moving-average

I have a pandas dataframe with monthly data that I want to compute a 12 months moving average for. Data for for every month of January is missing, however (NaN), so I am using

pd.rolling_mean(data["variable"]), 12, center=True)

but it just gives me all NaN values.

Is there a simple way that I can ignore the NaN values? I understand that in practice this would become a 11-month moving average.

The dataframe has other variables which have January data, so I don't want to just throw out the January columns and do an 11 month moving average.

868

asked Aug 11 '14 01:08

Alexis Eggermont

1 Answers

There are several ways to approach this, and the best way will depend on whether the January data is systematically different from other months. Most real-world data is likely to be somewhat seasonal, so let's use the average high temperature (Fahrenheit) of a random city in the northern hemisphere as an example.

df=pd.DataFrame({ 'month' : [10,11,12,1,2,3],
                  'temp'  : [65,50,45,np.nan,40,43] }).set_index('month')

You could use a rolling mean as you suggest, but the issue is that you will get an average temperature over the entire year, which ignores the fact that January is the coldest month. To correct for this, you could reduce the window to 3, which results in the January temp being the average of the December and February temps. (I am also using min_periods=1 as suggested in @user394430's answer.)

df['rollmean12'] = df['temp'].rolling(12,center=True,min_periods=1).mean()
df['rollmean3']  = df['temp'].rolling( 3,center=True,min_periods=1).mean()

Those are improvements but still have the problem of overwriting existing values with rolling means. To avoid this you could combine with the update() method (see documentation here).

df['update'] = df['rollmean3']
df['update'].update( df['temp'] )  # note: this is an inplace operation

There are even simpler approaches that leave the existing values alone while filling the missing January temps with either the previous month, next month, or the mean of the previous and next month.

df['ffill']   = df['temp'].ffill()         # previous month 
df['bfill']   = df['temp'].bfill()         # next month
df['interp']  = df['temp'].interpolate()   # mean of prev/next

In this case, interpolate() defaults to simple linear interpretation, but you have several other intepolation options also. See documentation on pandas interpolate for more info. Or this statck overflow question: Interpolation on DataFrame in pandas

Here is the sample data with all the results:

       temp  rollmean12  rollmean3  update  ffill  bfill  interp
month                                                           
10     65.0        48.6  57.500000    65.0   65.0   65.0    65.0
11     50.0        48.6  53.333333    50.0   50.0   50.0    50.0
12     45.0        48.6  47.500000    45.0   45.0   45.0    45.0
1       NaN        48.6  42.500000    42.5   45.0   40.0    42.5
2      40.0        48.6  41.500000    40.0   40.0   40.0    40.0
3      43.0        48.6  41.500000    43.0   43.0   43.0    43.0

In particular, note that "update" and "interp" give the same results in all months. While it doesn't matter which one you use here, in other cases one way or the other might be better.

127

answered Sep 28 '22 10:09

JohnE

Related questions
                            
                                How to concisely cascade through multiple regex statements in Python
                            
                                Change file creation date
                            
                                Can you help me solve this SUDS/SOAP issue?
                            
                                How to slice a 2D Python Array? Fails with: "TypeError: list indices must be integers, not tuple"
                            
                                Add advanced features to a tkinter Text widget
                            
                                How to resolve DNS in Python?
                            
                                How to capture pygame screen?
                            
                                Why do Python unicode strings require special treatment for UTF-8 BOM?
                            
                                How to intercept a method call which doesn't exist?
                            
                                add a number to all odd or even indexed elements in numpy array without loops
                            
                                finding top k largest keys in a dictionary python
                            
                                Extracting selected columns from a table using BeautifulSoup
                            
                                How do we get TXT, CNAME and SOA records from dnspython?
                            
                                Find http:// and or www. and strip from domain. leaving domain.com
                            
                                Kivy button text alignment issue
                            
                                Contour graph in python
                            
                                flatten list of list through list comprehension
                            
                                Remove \n or \t from a given string
                            
                                In CMD "python" starts Python 3.3, "py" starts Python 2.7, how do I change this?
                            
                                Can't install python mysql library on Mac Mavericks

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With