<p>I would like to get dataframe subsets in a "rolling" manner. I tried several things without success, here is an example of what I would like to do. Let's consider dataframe.</p> <pre class="prettyprint"><code>df var1 var2 0 43 74 1 44 74 2 45 66 3 46 268 4 47 66 </code></pre> <p>I would like to create a new column with the following function which performs a conditional sum:</p> <pre class="prettyprint"><code>def func(x): tmp = (x["var1"] * (x["var2"] == 74)).sum() return tmp </code></pre> <p>and calling it like this</p> <pre class="prettyprint"><code>df["newvar"] = df.rolling(2, min_periods=1).apply(func) </code></pre> <p>That would mean that the function would be applied on dataframe basis, and not for each row or column</p> <p>It would return</p> <pre class="prettyprint"><code> var1 var2 newvar 0 43 74 43 # 43 1 44 74 87 # 43 * 1 + 44 * 1 2 45 66 44 # 44 * 1 + 45 * 0 3 46 268 0 # 45 * 0 + 46 * 0 4 47 66 0 # 46 * 0 + 47 * 0 </code></pre> <p>Is there a pythonic way to do this? This is just an example but the condition (always based on the sub-dataframe values depends on more than 2 columns.</p>

<h3>updated comment</h3> <p>@unutbu posted a great answer to a very similar question here but it appears that his answer is based on <code>pd.rolling_apply</code> which passes the index to the function. I'm not sure how to replicate this with the current <code>DataFrame.rolling.apply</code> method.</p> <h3>original answer</h3> <p>It appears that the variable passed to the argument through the <code>apply</code> function is a numpy array of each column (one at a time) and not a DataFrame so you do not have access to any other columns unfortunately.</p> <p>But what you can do is use some boolean logic to temporarily create a new column based on whether <code>var2</code> is 74 or not and then use the rolling method.</p> <pre class="prettyprint"><code>df['new_var'] = df.var2.eq(74).mul(df.var1).rolling(2, min_periods=1).sum() var1 var2 new_var 0 43 74 43.0 1 44 74 87.0 2 45 66 44.0 3 46 268 0.0 4 47 66 0.0 </code></pre> <p>The temporary column is based on the first half of the code above.</p> <pre class="prettyprint"><code>df.var2.eq(74).mul(df.var1) # or equivalently with operators # (df['var2'] == 74) * df['var1'] 0 43 1 44 2 0 3 0 4 0 </code></pre> <h3>Finding the type of the variable passed to apply</h3> <p>Its very important to know what is actually being passed to the apply function and I can't always remember what is being passed so if I am unsure I will print out the variable along with its type so that it is clear to me what object I am dealing with. See this example with your original DataFrame.</p> <pre class="prettyprint"><code>def foo(x): print(x) print(type(x)) return x.sum() df.rolling(2, min_periods=1).apply(foo) </code></pre> <p>Output</p> <pre class="prettyprint"><code>[ 43.] <class 'numpy.ndarray'> [ 43. 44.] <class 'numpy.ndarray'> [ 44. 45.] <class 'numpy.ndarray'> [ 45. 46.] <class 'numpy.ndarray'> [ 46. 47.] <class 'numpy.ndarray'> [ 74.] <class 'numpy.ndarray'> [ 74. 74.] <class 'numpy.ndarray'> [ 74. 66.] <class 'numpy.ndarray'> [ 66. 268.] <class 'numpy.ndarray'> [ 268. 66.] <class 'numpy.ndarray'> </code></pre>

<p>Here's how you get dataframe subsets in a rolling manner:</p> <pre class="prettyprint"><code>for df_subset in df.rolling(2): print(type(df_subset), '\n', df_subset) </code></pre>

How to get rolling pandas dataframe subsets

Tags:

python

pandas

I would like to get dataframe subsets in a "rolling" manner. I tried several things without success, here is an example of what I would like to do. Let's consider dataframe.

df
     var1      var2
0    43         74
1    44         74
2    45         66
3    46        268
4    47         66

I would like to create a new column with the following function which performs a conditional sum:

def func(x):
    tmp = (x["var1"] * (x["var2"] == 74)).sum()
    return tmp

and calling it like this

df["newvar"] = df.rolling(2, min_periods=1).apply(func)

That would mean that the function would be applied on dataframe basis, and not for each row or column

It would return

     var1      var2      newvar
0    43         74         43          # 43
1    44         74         87          # 43 * 1 + 44 * 1
2    45         66         44          # 44 * 1 + 45 * 0
3    46        268         0           # 45 * 0 + 46 * 0
4    47         66         0           # 46 * 0 + 47 * 0

Is there a pythonic way to do this? This is just an example but the condition (always based on the sub-dataframe values depends on more than 2 columns.

992

asked Jan 17 '17 15:01

user6903745

2 Answers

updated comment

@unutbu posted a great answer to a very similar question here but it appears that his answer is based on pd.rolling_apply which passes the index to the function. I'm not sure how to replicate this with the current DataFrame.rolling.apply method.

original answer

It appears that the variable passed to the argument through the apply function is a numpy array of each column (one at a time) and not a DataFrame so you do not have access to any other columns unfortunately.

But what you can do is use some boolean logic to temporarily create a new column based on whether var2 is 74 or not and then use the rolling method.

df['new_var'] = df.var2.eq(74).mul(df.var1).rolling(2, min_periods=1).sum()

   var1  var2  new_var
0    43    74     43.0
1    44    74     87.0
2    45    66     44.0
3    46   268      0.0
4    47    66      0.0

The temporary column is based on the first half of the code above.

df.var2.eq(74).mul(df.var1)
# or equivalently with operators
# (df['var2'] == 74) * df['var1']

0    43
1    44
2     0
3     0
4     0

Finding the type of the variable passed to apply

Its very important to know what is actually being passed to the apply function and I can't always remember what is being passed so if I am unsure I will print out the variable along with its type so that it is clear to me what object I am dealing with. See this example with your original DataFrame.

def foo(x):
    print(x)
    print(type(x))
    return x.sum()

df.rolling(2, min_periods=1).apply(foo)

Output

[ 43.]
<class 'numpy.ndarray'>
[ 43.  44.]
<class 'numpy.ndarray'>
[ 44.  45.]
<class 'numpy.ndarray'>
[ 45.  46.]
<class 'numpy.ndarray'>
[ 46.  47.]
<class 'numpy.ndarray'>
[ 74.]
<class 'numpy.ndarray'>
[ 74.  74.]
<class 'numpy.ndarray'>
[ 74.  66.]
<class 'numpy.ndarray'>
[  66.  268.]
<class 'numpy.ndarray'>
[ 268.   66.]
<class 'numpy.ndarray'>

136

answered Oct 22 '22 03:10

Ted Petrou

Here's how you get dataframe subsets in a rolling manner:

for df_subset in df.rolling(2):
   print(type(df_subset), '\n', df_subset)

answered Oct 22 '22 01:10

Tigger

Related questions
                            
                                How to use SQLAlchemy with class attributes (and properties)?
                            
                                3D Geometry Package for Python [closed]
                            
                                Pandas read_sql query with multiple selects
                            
                                Jupyter: Replot in different cell
                            
                                extract hour from timestamp with python
                            
                                Incomplete coordinate values for Google Vision OCR
                            
                                Python : error with importing md5
                            
                                Python datetime difference between .localize and tzinfo
                            
                                Use multi-processing/threading to break numpy array operation into chunks
                            
                                How to uninstall python jupyter correctly?
                            
                                Blaze with Scikit Learn K-Means
                            
                                how to effeciently convert ROS PointCloud2 to pcl point cloud and visualize it in python
                            
                                How can I write a C function that takes either an int or a float?
                            
                                pyspark Do python processes on an executor node share broadcast variables in ram?
                            
                                Python: Find equivalent surrogate pair from non-BMP unicode char
                            
                                Cannot create more than 10 mqueues
                            
                                pandas.DataFrame: .hist() vs .plot.hist() methods
                            
                                Multi-threading in selenium python
                            
                                CRC32 calculation in Python without using libraries
                            
                                Integrating Swagger/OpenAPI generated python server with existing Flask application

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With