Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to get rolling pandas dataframe subsets

Tags:

python

pandas

I would like to get dataframe subsets in a "rolling" manner. I tried several things without success, here is an example of what I would like to do. Let's consider dataframe.

df
     var1      var2
0    43         74
1    44         74
2    45         66
3    46        268
4    47         66

I would like to create a new column with the following function which performs a conditional sum:

def func(x):
    tmp = (x["var1"] * (x["var2"] == 74)).sum()
    return tmp

and calling it like this

df["newvar"] = df.rolling(2, min_periods=1).apply(func)

That would mean that the function would be applied on dataframe basis, and not for each row or column

It would return

     var1      var2      newvar
0    43         74         43          # 43
1    44         74         87          # 43 * 1 + 44 * 1
2    45         66         44          # 44 * 1 + 45 * 0
3    46        268         0           # 45 * 0 + 46 * 0
4    47         66         0           # 46 * 0 + 47 * 0

Is there a pythonic way to do this? This is just an example but the condition (always based on the sub-dataframe values depends on more than 2 columns.

like image 992
user6903745 Avatar asked Jan 17 '17 15:01

user6903745


People also ask

What does .values do to a DataFrame?

The values property is used to get a Numpy representation of the DataFrame. Only the values in the DataFrame will be returned, the axes labels will be removed. The values of the DataFrame. A DataFrame where all columns are the same type (e.g., int64) results in an array of the same type.


2 Answers

updated comment

@unutbu posted a great answer to a very similar question here but it appears that his answer is based on pd.rolling_apply which passes the index to the function. I'm not sure how to replicate this with the current DataFrame.rolling.apply method.

original answer

It appears that the variable passed to the argument through the apply function is a numpy array of each column (one at a time) and not a DataFrame so you do not have access to any other columns unfortunately.

But what you can do is use some boolean logic to temporarily create a new column based on whether var2 is 74 or not and then use the rolling method.

df['new_var'] = df.var2.eq(74).mul(df.var1).rolling(2, min_periods=1).sum()

   var1  var2  new_var
0    43    74     43.0
1    44    74     87.0
2    45    66     44.0
3    46   268      0.0
4    47    66      0.0

The temporary column is based on the first half of the code above.

df.var2.eq(74).mul(df.var1)
# or equivalently with operators
# (df['var2'] == 74) * df['var1']

0    43
1    44
2     0
3     0
4     0

Finding the type of the variable passed to apply

Its very important to know what is actually being passed to the apply function and I can't always remember what is being passed so if I am unsure I will print out the variable along with its type so that it is clear to me what object I am dealing with. See this example with your original DataFrame.

def foo(x):
    print(x)
    print(type(x))
    return x.sum()

df.rolling(2, min_periods=1).apply(foo)

Output

[ 43.]
<class 'numpy.ndarray'>
[ 43.  44.]
<class 'numpy.ndarray'>
[ 44.  45.]
<class 'numpy.ndarray'>
[ 45.  46.]
<class 'numpy.ndarray'>
[ 46.  47.]
<class 'numpy.ndarray'>
[ 74.]
<class 'numpy.ndarray'>
[ 74.  74.]
<class 'numpy.ndarray'>
[ 74.  66.]
<class 'numpy.ndarray'>
[  66.  268.]
<class 'numpy.ndarray'>
[ 268.   66.]
<class 'numpy.ndarray'>
like image 136
Ted Petrou Avatar answered Oct 22 '22 03:10

Ted Petrou


Here's how you get dataframe subsets in a rolling manner:

for df_subset in df.rolling(2):
   print(type(df_subset), '\n', df_subset)
like image 42
Tigger Avatar answered Oct 22 '22 01:10

Tigger