Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas set value in groupby

I have a DataFrame...

>>> df = pd.DataFrame({
...            'letters' : ['a', 'a', 'a', 'b', 'b', 'b', 'c', 'c', 'c'], 
...            'is_min' : np.zeros(9),
...            'numbers' : np.random.randn(9)
... })

    is_min  letters numbers
0   0       a       0.322499
1   0       a      -0.196617
2   0       a      -1.194251
3   0       b       1.005323
4   0       b      -0.186364
5   0       b      -1.886273
6   0       c       0.014960
7   0       c      -0.832713
8   0       c       0.689531

I would like to set the 'is_min' col to 1 if 'numbers' is the minimum value by column 'letters'. I have tried this and feel that I am close...

>>> df.groupby('letters')['numbers'].transform('idxmin')

0    2
1    2
2    2
3    5
4    5
5    5
6    7
7    7
8    7
dtype: int64

I am having a hard time connecting the dots to set the val of 'is_min' to 1.

like image 793
Bruce Pucci Avatar asked Jan 27 '16 19:01

Bruce Pucci


People also ask

How do I group values in a column in pandas?

Group by and value_counts Groupby is a very powerful pandas method. You can group by one column and count the values of another column per this column value using value_counts. Using groupby and value_counts we can count the number of activities each person did.

Does groupby sort values?

Sort Values in Descending Order with GroupbyYou can sort values in descending order by using ascending=False param to sort_values() method. The head() function is used to get the first n rows. It is useful for quickly testing if your object has the right type of data in it. Yields below output.


2 Answers

Pass the row labels to loc and set the column:

In [34]:
df.loc[df.groupby('letters')['numbers'].transform('idxmin'), 'is_min']=1
df

Out[34]:
   is_min letters   numbers
0       1       a -0.374751
1       0       a  1.663334
2       0       a -0.123599
3       1       b -2.156204
4       0       b  0.201493
5       0       b  1.639512
6       0       c -0.447271
7       0       c  0.017204
8       1       c -1.261621

So what's happening here is that by calling loc we only select the rows that are returned by your transform method and these get set to 1 as desired.

Not sure if it matters much but you could call unique so that you get just the row labels without repetition which may be faster:

df.loc[df.groupby('letters')['numbers'].transform('idxmin').unique(), 'is_min']=1
like image 151
EdChum Avatar answered Sep 21 '22 18:09

EdChum


I would like to set the 'is_min' col to 1 if 'numbers' is the minimum value by column 'letters'.

A perhaps more intuitive method is to calculate the minima per group of letters, then use group-wise .apply to assign is_min:

def set_is_min(m):
   df.loc[df.numbers == m, 'is_min'] = 1
mins = df.groupby('letters').numbers.min().apply(set_is_min)

In large dataframes, this method is actually 20% faster than using transform:

# timeit with 100'000 rows
# .apply on group minima
100 loops, best of 3: 16.7 ms per loop
# .transform
10 loops, best of 3: 21.9 ms per loop

I ran a some more benchmarks of various methods using apply and transform.

like image 37
miraculixx Avatar answered Sep 20 '22 18:09

miraculixx