I'd like to extract the bins for 'S' individually, where each column (X&Y) > 0.5, or multiple bins > 0.5 * 'number of rows'.
In the example;
for 'AR1' should only bin 4 be selected, because 'X' and 'Y' are > 0.5 (blue indicated)
for 'PO1' should bins 1, 2, 3 and 4 be selected, because 'X' and 'Y' are > (4 * 0.5) (yellow indicated).
I've tried this earlier with a for loop
, but that didn't work correctly;
Selecting multiple (neighboring) rows conditionally
np.random.seed(0)
N = 20
S = ['AR1', 'PO1']
df = pd.DataFrame(
{'X':np.random.uniform(-1,1,N),
'Y':np.random.uniform(-1,1,N),
'S':np.random.choice(S,N),
})
df['bins_X'] = df.groupby('S')['X'].apply(pd.qcut, q=5, labels=np.arange(5)) # create bins per column 'S'
def func(df): # create function to group per 'S' and their bins
df1 = df.groupby(['S','bins_X']).sum()
new_cols= list(zip(df1.columns.get_level_values(0)))
df1.columns = pd.MultiIndex.from_tuples(new_cols)
return df1
print func(df)
EDIT
What it should look like is the df as shown in the question, but the rows that do not qualify filtered out. What I check for is this ; the values in X and Y > 0.5 for any row(bin) separately or combined. Combinations of rows only consecutively, 2, 3, 4 or 5 rows combined.
I.e, the combinations of rows for 0 are then ; 0+1, 0+1+2, 0+1+2+3 and 0+1+2+3+4. For 1 ; 1+2, 1+2+3 and 1+2+3+4 etc.
Multiple rows would sum to the number of rows x 0.5, X and Y would have to be > 2.5 for rows 0 to 4 for example.
EDIT2: @JohnE and piRSquared, both your solutions work, which one however would work better when there are other columns in the dataframe that should not be evaluated?
In addition, what if I would want to addin an additional condition in your solutions?
EDIT3: @piRSquared, When subsetting some columns I only get those returned, where I would need all of them, not only the subsetted.
Could you assist? Thanks.
You can select rows from a list index using index. isin() Method which is used to check each element in the DataFrame is contained in values or not. This is the fasted approach. Note that this option doesn't work if you have labels for index.
This is a vectorized approach with only one loop at the top level (groupby.apply
)
# columns that I care about
cols = ['X', 'Y']
df1.groupby(level=0)[cols].apply(find_window)
def find_window(df):
v = df.values
s = np.vstack([np.zeros((1, v.shape[1])), v.cumsum(0)])
threshold = .5
r, c = np.triu_indices(s.shape[0], 1)
d = (c - r)[:, None]
e = s[c] - s[r]
mask = (e / d > threshold).all(1)
rng = np.arange(mask.shape[0])
if mask.any():
idx = rng[mask][d[mask].argmax()]
i0, i1 = r[idx], c[idx]
return pd.DataFrame(
v[i0:i1],
df.loc[df.name].index[i0:i1],
df.columns
)
strategy
numpy.triu_indices
: I need to evaluate each possible window for the rolling mean
greater than some threshold
. I'm going to capture each possible window by starting from position 0 to 0, then 0 to 1 then ... then 1 to 1, 1 to 2 ... so on and so forth. But I must always start at a position before I finish. I can access these combinations with numpy.triu_indices
.cumsum
: It would be a little tricky (doable) to get the expanded arrays specified by each combination of indices that I get from np.triu_indices
. A better way is to calculate the cumsum
and take the difference from one index to the next.cumsum
so that I can take the difference for the first row.e / d
, I check which are > threshold
and identify which combinations of start and end positions have means greater than the threshold for both columns.groupby
and apply
... QEDtime test
with more data
np.random.seed(0)
N = 300
S = ['AR1', 'PO1', 'AR2', 'PO2', 'AR3', 'PO3']
df = pd.DataFrame(
{'X':np.random.uniform(-1,1,N),
'Y':np.random.uniform(-1,1,N),
'S':np.random.choice(S,N),
})
df['bins_X'] = df.groupby('S')['X'].apply(pd.qcut, q=20, labels=np.arange(20)) # create bins per column 'S'
def func(df): # create function to group per 'S' and their bins
df1 = df.groupby(['S','bins_X']).sum()
new_cols= list(zip(df1.columns.get_level_values(0)))
df1.columns = pd.MultiIndex.from_tuples(new_cols)
return df1
df1 = func(df)
The time difference is even more dramatic
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With