I have a large pandas dataframe of time-series data. I currently manipulate this dataframe to create a new, smaller dataframe that is rolling average of every 10 rows. i.e. a rolling window technique. Like this: <pre class="prettyprint"><code>def create_new_df(df): features = [] x = df['X'].astype(float) i = x.index.values time_sequence = [i] * 10 idx = np.array(time_sequence).T.flatten()[:len(x)] x = x.groupby(idx).mean() x.name = 'X' features.append(x) new_df = pd.concat(features, axis=1) return new_df </code></pre> Code to test: <pre class="prettyprint"><code>columns = ['X'] df_ = pd.DataFrame(columns=columns) df_ = df_.fillna(0) # with 0s rather than NaNs data = np.array([np.arange(20)]*1).T df = pd.DataFrame(data, columns=columns) test = create_new_df(df) print test </code></pre> Output: <pre class="prettyprint"><code> X 0 4.5 1 14.5 </code></pre> However, I want the function to make the new dataframe using a sliding window with a 50% overlap So the output would look like this: <pre class="prettyprint"><code> X 0 4.5 1 9.5 2 14.5 </code></pre> How can I do this? Here's what I've tried: <pre class="prettyprint"><code>from itertools import tee, izip def window(iterable, size): iters = tee(iterable, size) for i in xrange(1, size): for each in iters[i:]: next(each, None) return izip(*iters) for each in window(df, 20): print list(each) # doesn't have the desired sliding window effect </code></pre> Some might also suggest using the pandas rolling_mean() methods, but if so, I can't see how to use this function with window overlap. Any help would be much appreciated.

I think pandas rolling techniques are fine here. Note that starting with version 0.18.0 of pandas, you would use <code>rolling().mean()</code> instead of <code>rolling_mean()</code>. <pre class="prettyprint"><code>>>> df=pd.DataFrame({ 'x':range(30) }) >>> df = df.rolling(10).mean() # version 0.18.0 syntax >>> df[4::5] # take every 5th row x 4 NaN 9 4.5 14 9.5 19 14.5 24 19.5 29 24.5 </code></pre>

Sliding Window over Pandas Dataframe

Tags:

python

pandas

numpy

I have a large pandas dataframe of time-series data.

I currently manipulate this dataframe to create a new, smaller dataframe that is rolling average of every 10 rows. i.e. a rolling window technique. Like this:

def create_new_df(df):
    features = []
    x = df['X'].astype(float)
    i = x.index.values
    time_sequence = [i] * 10
    idx = np.array(time_sequence).T.flatten()[:len(x)]
    x = x.groupby(idx).mean()
    x.name = 'X'
    features.append(x)
    new_df = pd.concat(features, axis=1)
    return new_df

Code to test:

columns = ['X']
df_ = pd.DataFrame(columns=columns)
df_ = df_.fillna(0) # with 0s rather than NaNs
data = np.array([np.arange(20)]*1).T
df = pd.DataFrame(data, columns=columns)

test = create_new_df(df)
print test

Output:

      X
0   4.5
1  14.5

However, I want the function to make the new dataframe using a sliding window with a 50% overlap

So the output would look like this:

How can I do this?

Here's what I've tried:

from itertools import tee, izip

def window(iterable, size):
    iters = tee(iterable, size)
    for i in xrange(1, size):
        for each in iters[i:]:
            next(each, None)
    return izip(*iters)

for each in window(df, 20):
    print list(each) # doesn't have the desired sliding window effect

Some might also suggest using the pandas rolling_mean() methods, but if so, I can't see how to use this function with window overlap.

Any help would be much appreciated.

846

asked Apr 29 '16 12:04

cs_stackX

1 Answers

I think pandas rolling techniques are fine here. Note that starting with version 0.18.0 of pandas, you would use rolling().mean() instead of rolling_mean().

>>> df=pd.DataFrame({ 'x':range(30) })
>>> df = df.rolling(10).mean()           # version 0.18.0 syntax
>>> df[4::5]                             # take every 5th row

       x
4    NaN
9    4.5
14   9.5
19  14.5
24  19.5
29  24.5

answered Sep 18 '22 09:09

JohnE

Related questions
                            
                                can't compare offset-naive and offset-aware datetimes - last_seen option [duplicate]
                            
                                How to plot a ylabel per subplot using pandas DataFrame plot function
                            
                                Concatenating Unicode with string: print '£' + '1' works, but print '£' + u'1' throws UnicodeDecodeError
                            
                                PIL/Pillow decode icc profile information
                            
                                pip install vs. conda install
                            
                                Put multiple items in a python queue
                            
                                Python Pandas Choosing Random Sample of Groups from Groupby
                            
                                How to fix Statsmodel warning: "Maximum no. of iterations has exceeded"
                            
                                Installing OpenCV 3 for Python 3 on a mac using Homebrew and pyenv
                            
                                Keras load weights of a neural network / error when predicting
                            
                                How to use Tweepy to retweet with a comment
                            
                                Option to ignore extra keywords in an sqlalchemy Mapped Class constructor?
                            
                                Accessing folders, subfolders and subfiles using PyDrive (Python)
                            
                                How can a Python list be sliced such that a column is moved to being a separate element column?
                            
                                Getting boolean pandas column that supports NA/ is nullable
                            
                                How to configure uwsgi to use multiple python paths
                            
                                Connecting to IBM AS400 server for database operations hangs
                            
                                pyspark partitioning data using partitionby
                            
                                How to remove the "- -" from flask's logging?
                            
                                django SimpleListFilter

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With