Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas: How to select a column in rolling window

I have a dataframe (with columns 'a', 'b', 'c') on which I am doing a rolling-window.

I want to be able to filter the rolling window using one of the columns (say 'a') in the apply function like below

df.rolling(len(s),min_periods=0).apply(lambda x: x[[x['a']>10][0] if len(x[[x['a']>10]]) >=0 else np.nan)

The intention of above line is to select the first row in the rolling window whose 'a' column has value greater than 10. If there is no such row, then return nan.

But I am unable to do so and get the following error

IndexError: only integers, slices (`:`), ellipsis (`...`), numpy.newaxis (`None`) and integer or boolean arrays are valid indices

This means that I am not allowed to access the individual columns at all by this syntax. Is there any other way of doing this kind of thing?

like image 273
ishan3243 Avatar asked Jun 21 '17 10:06

ishan3243


1 Answers

Your error stems from assuming what comes to the function inside apply is a dataframe, it is actually a ndarray not a dataframe.

Pandas dataframe apply works on each column/series of the dataframe, so any function passed to apply is applied along each column/series like an internal lambda. In case of windowed dataframe, apply takes each column/series inside the each window and passes to the function as ndarray and the function has to return only array of length 1 per one series per one window. Knowing this saves a lot of pain.

so in your case you cannot use any apply unless you have a complex function that remembers first value of the series a for each window.

For OP's case if a column of the window say a is meeting a condition, say > 10

  1. For case where a in the first row of a window meets condition it is same as searching in dataframe df[df['a']>10].

  2. For other conditions like a in second row of a window is > 10, checking the entire dataframe works except for the first window of the dataframe.

Following example demonstrates another way to solution.

import numpy as np
import pandas as pd
np.random.seed(123)
df = pd.DataFrame(np.random.randint(0,20,size=(20, 4)), columns=list('abcd'))

df looks like

    a   b   b   d
0   13  2   2   6
1   17  19  10  1
2   0   17  15  9
3   0   14  0   15
4   19  14  4   0
5   16  4   17  3
6   2   7   2   15
7   16  7   9   3
8   6   1   2   1
9   12  8   3   10
10  5   0   11  2
11  10  13  18  4
12  15  11  12  6
13  13  19  16  6
14  14  7   11  7
15  1   11  5   18
16  17  12  18  17
17  1   19  12  9
18  16  17  3   3
19  11  7   9   2

now to select a window if second row inside rolling window of a meets a condition a > 10 like in OP's question.

roll_window=5
search_index=1

df_roll = df['a'].rolling(roll_window)
df_y = df_roll.apply(lambda x:x[1] if x[1] > 10 else np.nan).dropna()

above line returns all values of a corresponding to condition a in second row of a window greater then 10. Note the values are right based on example dataframe above but the indexes are defined by how rolling window was centered.

4     17.0
7     19.0
8     16.0
10    16.0
12    12.0
15    15.0
16    13.0
17    14.0
19    17.0

to get the right index location and entire row inside the first dataframe

df.loc[df_y.index+searchindex-rollwindow+1]

returns

    a   b   b   d
1   17  19  10  1
4   19  14  4   0
5   16  4   17  3
7   16  7   9   3
9   12  8   3   10
12  15  11  12  6
13  13  19  16  6
14  14  7   11  7
16  17  12  18  17

one could also use np.array(df) and make a rolling slice corresponding to rolling window and filter the array using slices correspondingly.

like image 99
suvy Avatar answered Sep 18 '22 01:09

suvy