I would like to get dataframe subsets in a "rolling" manner. I tried several things without success, here is an example of what I would like to do. Let's consider dataframe.
df
var1 var2
0 43 74
1 44 74
2 45 66
3 46 268
4 47 66
I would like to create a new column with the following function which performs a conditional sum:
def func(x):
tmp = (x["var1"] * (x["var2"] == 74)).sum()
return tmp
and calling it like this
df["newvar"] = df.rolling(2, min_periods=1).apply(func)
That would mean that the function would be applied on dataframe basis, and not for each row or column
It would return
var1 var2 newvar
0 43 74 43 # 43
1 44 74 87 # 43 * 1 + 44 * 1
2 45 66 44 # 44 * 1 + 45 * 0
3 46 268 0 # 45 * 0 + 46 * 0
4 47 66 0 # 46 * 0 + 47 * 0
Is there a pythonic way to do this? This is just an example but the condition (always based on the sub-dataframe values depends on more than 2 columns.
The values property is used to get a Numpy representation of the DataFrame. Only the values in the DataFrame will be returned, the axes labels will be removed. The values of the DataFrame. A DataFrame where all columns are the same type (e.g., int64) results in an array of the same type.
@unutbu posted a great answer to a very similar question here but it appears that his answer is based on pd.rolling_apply
which passes the index to the function. I'm not sure how to replicate this with the current DataFrame.rolling.apply
method.
It appears that the variable passed to the argument through the apply
function is a numpy array of each column (one at a time) and not a DataFrame so you do not have access to any other columns unfortunately.
But what you can do is use some boolean logic to temporarily create a new column based on whether var2
is 74 or not and then use the rolling method.
df['new_var'] = df.var2.eq(74).mul(df.var1).rolling(2, min_periods=1).sum()
var1 var2 new_var
0 43 74 43.0
1 44 74 87.0
2 45 66 44.0
3 46 268 0.0
4 47 66 0.0
The temporary column is based on the first half of the code above.
df.var2.eq(74).mul(df.var1)
# or equivalently with operators
# (df['var2'] == 74) * df['var1']
0 43
1 44
2 0
3 0
4 0
Its very important to know what is actually being passed to the apply function and I can't always remember what is being passed so if I am unsure I will print out the variable along with its type so that it is clear to me what object I am dealing with. See this example with your original DataFrame.
def foo(x):
print(x)
print(type(x))
return x.sum()
df.rolling(2, min_periods=1).apply(foo)
Output
[ 43.]
<class 'numpy.ndarray'>
[ 43. 44.]
<class 'numpy.ndarray'>
[ 44. 45.]
<class 'numpy.ndarray'>
[ 45. 46.]
<class 'numpy.ndarray'>
[ 46. 47.]
<class 'numpy.ndarray'>
[ 74.]
<class 'numpy.ndarray'>
[ 74. 74.]
<class 'numpy.ndarray'>
[ 74. 66.]
<class 'numpy.ndarray'>
[ 66. 268.]
<class 'numpy.ndarray'>
[ 268. 66.]
<class 'numpy.ndarray'>
Here's how you get dataframe subsets in a rolling manner:
for df_subset in df.rolling(2):
print(type(df_subset), '\n', df_subset)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With