Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Select particular rows from inside groups in pandas dataframe

Suppose I have a dataframe that looks like this:

   group  level
0      1     10
1      1     10
2      1     11
3      2      5
4      2      5
5      3      9
6      3      9
7      3      9
8      3      8

The desired output is this:

   group  level
0      1     10
5      3      9

Namely, this is the logic: look inside each group, if there is more than 1 distinct value present in the level column, return the first row in that group. For example, no row from group 2 is selected, because the only value present in the level column is 5.

In addition, how does the situation change if I want the last, instead of the first row of such groups?

What I have attempted was combining group_by statements, with creating sets from entries in the level column, but failed to produce anything even nearly sensible.

like image 947
Baron Yugovich Avatar asked Nov 20 '25 05:11

Baron Yugovich


2 Answers

This can be done with groupby and using apply to run a simple function on each group:

def get_first_val(group):
    has_multiple_vals = len(group['level'].unique()) >= 2
    if has_multiple_vals:
        return group['level'].loc[group['level'].first_valid_index()]
    else:
        return None

df.groupby('group').apply(get_first_val).dropna()
Out[8]: 
group
1    10
3     9
dtype: float64

There's also a last_valid_index() method, so you wouldn't have to make any huge changes to get the last row instead.

If you have other columns that you want to keep, you just need a slight tweak:

import numpy as np
df['col1'] = np.random.randint(10, 20, 9)
df['col2'] = np.random.randint(20, 30, 9)
df
Out[17]: 
   group  level  col1  col2
0      1     10    19    21
1      1     10    18    24
2      1     11    14    23
3      2      5    14    26
4      2      5    10    22
5      3      9    13    27
6      3      9    16    20
7      3      9    18    26
8      3      8    11    2

def get_first_val_keep_cols(group):
    has_multiple_vals = len(group['level'].unique()) >= 2
    if has_multiple_vals:
        return group.loc[group['level'].first_valid_index(), :]
    else:
        return None

df.groupby('group').apply(get_first_val_keep_cols).dropna()
Out[20]: 
       group  level  col1  col2
group                          
1          1     10    19    21
3          3      9    13    27
like image 178
Marius Avatar answered Nov 21 '25 18:11

Marius


This would be simpler:

In [121]:

print df.groupby('group').\
          agg(lambda x: x.values[0] if (x.values!=x.values[0]).any() else np.nan).\
           dropna()

       level
group       
1         10
3          9

For each group, if any of the values are not the same as the first value, aggregate that group into its first value; otherwise, aggregate it to nan.

Finally, dropna().

like image 25
CT Zhu Avatar answered Nov 21 '25 19:11

CT Zhu



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!