Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Selecting rows in a MultiIndexed dataframe

I'd like to extract the bins for 'S' individually, where each column (X&Y) > 0.5, or multiple bins > 0.5 * 'number of rows'.

In the example;

for 'AR1' should only bin 4 be selected, because 'X' and 'Y' are > 0.5 (blue indicated)

for 'PO1' should bins 1, 2, 3 and 4 be selected, because 'X' and 'Y' are > (4 * 0.5) (yellow indicated).

I've tried this earlier with a for loop, but that didn't work correctly; Selecting multiple (neighboring) rows conditionally

np.random.seed(0)

N = 20
S = ['AR1', 'PO1']

df = pd.DataFrame(
    {'X':np.random.uniform(-1,1,N),
     'Y':np.random.uniform(-1,1,N),
     'S':np.random.choice(S,N),
    })

df['bins_X'] = df.groupby('S')['X'].apply(pd.qcut, q=5, labels=np.arange(5))    # create bins per column 'S'

def func(df):                                                                   # create function to group per 'S' and their bins
    df1 = df.groupby(['S','bins_X']).sum()
    new_cols= list(zip(df1.columns.get_level_values(0)))
    df1.columns = pd.MultiIndex.from_tuples(new_cols)
    return df1

print func(df)

enter image description here

EDIT

What it should look like is the df as shown in the question, but the rows that do not qualify filtered out. What I check for is this ; the values in X and Y > 0.5 for any row(bin) separately or combined. Combinations of rows only consecutively, 2, 3, 4 or 5 rows combined.

I.e, the combinations of rows for 0 are then ; 0+1, 0+1+2, 0+1+2+3 and 0+1+2+3+4. For 1 ; 1+2, 1+2+3 and 1+2+3+4 etc.

Multiple rows would sum to the number of rows x 0.5, X and Y would have to be > 2.5 for rows 0 to 4 for example.

EDIT2: @JohnE and piRSquared, both your solutions work, which one however would work better when there are other columns in the dataframe that should not be evaluated?

In addition, what if I would want to addin an additional condition in your solutions?

EDIT3: @piRSquared, When subsetting some columns I only get those returned, where I would need all of them, not only the subsetted.

Could you assist? Thanks.

like image 666
Zanshin Avatar asked Jan 24 '17 16:01

Zanshin


People also ask

How do you select rows based on index DataFrame?

You can select rows from a list index using index. isin() Method which is used to check each element in the DataFrame is contained in values or not. This is the fasted approach. Note that this option doesn't work if you have labels for index.


1 Answers

This is a vectorized approach with only one loop at the top level (groupby.apply)

# columns that I care about
cols = ['X', 'Y']
df1.groupby(level=0)[cols].apply(find_window)

enter image description here


def find_window(df):
    v = df.values
    s = np.vstack([np.zeros((1, v.shape[1])), v.cumsum(0)])

    threshold = .5

    r, c = np.triu_indices(s.shape[0], 1)
    d = (c - r)[:, None]
    e = s[c] - s[r]
    mask = (e / d > threshold).all(1)
    rng = np.arange(mask.shape[0])

    if mask.any():
        idx = rng[mask][d[mask].argmax()]

        i0, i1 = r[idx], c[idx]
        return pd.DataFrame(
            v[i0:i1],
            df.loc[df.name].index[i0:i1],
            df.columns
        )

Explanation

strategy

  • numpy.triu_indices: I need to evaluate each possible window for the rolling mean greater than some threshold. I'm going to capture each possible window by starting from position 0 to 0, then 0 to 1 then ... then 1 to 1, 1 to 2 ... so on and so forth. But I must always start at a position before I finish. I can access these combinations with numpy.triu_indices.
  • cumsum: It would be a little tricky (doable) to get the expanded arrays specified by each combination of indices that I get from np.triu_indices. A better way is to calculate the cumsum and take the difference from one index to the next.
  • I have to prepend zeros to my cumsum so that I can take the difference for the first row.
  • But sums are not means. I need to divide by the number of rows to get the means. Conveniently, the difference between the end and start positions is exactly the number of rows and thus the appropriate number to divide the sums by in order to calculate the means.
  • Now that I have the means, e / d, I check which are > threshold and identify which combinations of start and end positions have means greater than the threshold for both columns.
  • I then identify the combination with the greatest number of rows among those that have means greater than the threshold.
  • I unwind the positions and reconstruct a dataframe
  • groupby and apply... QED

time test

enter image description here


with more data

np.random.seed(0)

N = 300
S = ['AR1', 'PO1', 'AR2', 'PO2', 'AR3', 'PO3']

df = pd.DataFrame(
    {'X':np.random.uniform(-1,1,N),
     'Y':np.random.uniform(-1,1,N),
     'S':np.random.choice(S,N),
    })

df['bins_X'] = df.groupby('S')['X'].apply(pd.qcut, q=20, labels=np.arange(20))    # create bins per column 'S'

def func(df):                                                                   # create function to group per 'S' and their bins
    df1 = df.groupby(['S','bins_X']).sum()
    new_cols= list(zip(df1.columns.get_level_values(0)))
    df1.columns = pd.MultiIndex.from_tuples(new_cols)
    return df1

df1 = func(df)

The time difference is even more dramatic

enter image description here

like image 51
piRSquared Avatar answered Sep 22 '22 06:09

piRSquared