Apply function to pandas groupby

Tags:

I have a pandas dataframe with a column called my_labels which contains strings: 'A', 'B', 'C', 'D', 'E'. I would like to count the number of occurances of each of these strings then divide the number of counts by the sum of all the counts. I'm trying to do this in Pandas like this:

func = lambda x: x.size() / x.sum()
data = frame.groupby('my_labels').apply(func)

This code throws an error, 'DataFrame object has no attribute 'size'. How can I apply a function to calculate this in Pandas?

408

asked Mar 13 '13 00:03

turtle

3 Answers

apply takes a function to apply to each value, not the series, and accepts kwargs. So, the values do not have the .size() method.

Perhaps this would work:

from pandas import *

d = {"my_label": Series(['A','B','A','C','D','D','E'])}
df = DataFrame(d)


def as_perc(value, total):
    return value/float(total)

def get_count(values):
    return len(values)

grouped_count = df.groupby("my_label").my_label.agg(get_count)
data = grouped_count.apply(as_perc, total=df.my_label.count())

The .agg() method here takes a function that is applied to all values of the groupby object.

147

answered Oct 23 '22 07:10

monkut

As of Pandas version 0.22, there exists also an alternative to apply: pipe, which can be considerably faster than using apply (you can also check this question for more differences between the two functionalities).

For your example:

df = pd.DataFrame({"my_label": ['A','B','A','C','D','D','E']})

  my_label
0        A
1        B
2        A
3        C
4        D
5        D
6        E

The apply version

df.groupby('my_label').apply(lambda grp: grp.count() / df.shape[0])

gives

          my_label
my_label          
A         0.285714
B         0.142857
C         0.142857
D         0.285714
E         0.142857

and the pipe version

df.groupby('my_label').pipe(lambda grp: grp.size() / grp.size().sum())

yields

my_label
A    0.285714
B    0.142857
C    0.142857
D    0.285714
E    0.142857

So the values are identical, however, the timings differ quite a lot (at least for this small dataframe):

%timeit df.groupby('my_label').apply(lambda grp: grp.count() / df.shape[0])
100 loops, best of 3: 5.52 ms per loop

and

%timeit df.groupby('my_label').pipe(lambda grp: grp.size() / grp.size().sum())
1000 loops, best of 3: 843 µs per loop

Wrapping it into a function is then also straightforward:

def get_perc(grp_obj):
    gr_size = grp_obj.size()
    return gr_size / gr_size.sum()

Now you can call

df.groupby('my_label').pipe(get_perc)

yielding

my_label
A    0.285714
B    0.142857
C    0.142857
D    0.285714
E    0.142857

However, for this particular case, you do not even need a groupby, but you can just use value_counts like this:

df['my_label'].value_counts(sort=False) / df.shape[0]

yielding

A    0.285714
C    0.142857
B    0.142857
E    0.142857
D    0.285714
Name: my_label, dtype: float64

For this small dataframe it is quite fast

%timeit df['my_label'].value_counts(sort=False) / df.shape[0]
1000 loops, best of 3: 770 µs per loop

As pointed out by @anmol, the last statement can also be simplified to

df['my_label'].value_counts(sort=False, normalize=True)

answered Oct 23 '22 08:10

Cleb

Try:

g = pd.DataFrame(['A','B','A','C','D','D','E'])

# Group by the contents of column 0 
gg = g.groupby(0)  

# Create a DataFrame with the counts of each letter
histo = gg.apply(lambda x: x.count())

# Add a new column that is the count / total number of elements    
histo[1] = histo.astype(np.float)/len(g) 

print histo

Output:

   0         1
0             
A  2  0.285714
B  1  0.142857
C  1  0.142857
D  2  0.285714
E  1  0.142857

answered Oct 23 '22 07:10

Reservedegotist

Related questions
                            
                                How to get the current Python interpreter path from inside a Python script? [duplicate]
                            
                                Should a return statement have parentheses?
                            
                                Scikit-learn's LabelBinarizer vs. OneHotEncoder
                            
                                Does the SVM in sklearn support incremental (online) learning?
                            
                                SQLite Performance Benchmark -- why is :memory: so slow...only 1.5X as fast as disk?
                            
                                Computing diffs within groups of a dataframe
                            
                                Custom loss function in Keras
                            
                                Python: next() function
                            
                                Resource usage of google Go vs Python and Java on Appengine
                            
                                Time Series Decomposition function in Python
                            
                                Global error handler for any exception
                            
                                What is the difference between __init__.py and __main__.py? [duplicate]
                            
                                Is there an R equivalent of the pythonic "if __name__ == "__main__": main()"?
                            
                                Python: How to show matplotlib in flask [duplicate]
                            
                                Using Numpy Vectorize on Functions that Return Vectors
                            
                                Why is variable1 += variable2 much faster than variable1 = variable1 + variable2?
                            
                                How to rearrange array based upon index array
                            
                                Using Merge on a column and Index in Pandas
                            
                                Returning multiple values from pandas apply on a DataFrame
                            
                                Why is startswith slower than slicing

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Apply function to pandas groupby

Tags:

python

pandas