I have a pandas dataframe with a column called my_labels
which contains strings: 'A', 'B', 'C', 'D', 'E'
. I would like to count the number of occurances of each of these strings then divide the number of counts by the sum of all the counts. I'm trying to do this in Pandas like this:
func = lambda x: x.size() / x.sum()
data = frame.groupby('my_labels').apply(func)
This code throws an error, 'DataFrame object has no attribute 'size'. How can I apply a function to calculate this in Pandas?
The apply() method allows you to apply a function along one of the axis of the DataFrame, default 0, which is the index (row) axis.
Pandas Apply Function to Single Column We will create a function add_3() which adds value 3 column value and use this on apply() function. To apply it to a single column, qualify the column name using df["col_name"] . The below example applies a function to a column B .
groupby() function is used to split the data into groups based on some criteria. pandas objects can be split on any of their axes. The abstract definition of grouping is to provide a mapping of labels to group names. sort : Sort group keys.
apply
takes a function to apply to each value, not the series, and accepts kwargs.
So, the values do not have the .size()
method.
Perhaps this would work:
from pandas import *
d = {"my_label": Series(['A','B','A','C','D','D','E'])}
df = DataFrame(d)
def as_perc(value, total):
return value/float(total)
def get_count(values):
return len(values)
grouped_count = df.groupby("my_label").my_label.agg(get_count)
data = grouped_count.apply(as_perc, total=df.my_label.count())
The .agg()
method here takes a function that is applied to all values of the groupby object.
As of Pandas version 0.22, there exists also an alternative to apply
: pipe
, which can be considerably faster than using apply
(you can also check this question for more differences between the two functionalities).
For your example:
df = pd.DataFrame({"my_label": ['A','B','A','C','D','D','E']})
my_label
0 A
1 B
2 A
3 C
4 D
5 D
6 E
The apply
version
df.groupby('my_label').apply(lambda grp: grp.count() / df.shape[0])
gives
my_label
my_label
A 0.285714
B 0.142857
C 0.142857
D 0.285714
E 0.142857
and the pipe
version
df.groupby('my_label').pipe(lambda grp: grp.size() / grp.size().sum())
yields
my_label
A 0.285714
B 0.142857
C 0.142857
D 0.285714
E 0.142857
So the values are identical, however, the timings differ quite a lot (at least for this small dataframe):
%timeit df.groupby('my_label').apply(lambda grp: grp.count() / df.shape[0])
100 loops, best of 3: 5.52 ms per loop
and
%timeit df.groupby('my_label').pipe(lambda grp: grp.size() / grp.size().sum())
1000 loops, best of 3: 843 µs per loop
Wrapping it into a function is then also straightforward:
def get_perc(grp_obj):
gr_size = grp_obj.size()
return gr_size / gr_size.sum()
Now you can call
df.groupby('my_label').pipe(get_perc)
yielding
my_label
A 0.285714
B 0.142857
C 0.142857
D 0.285714
E 0.142857
However, for this particular case, you do not even need a groupby
, but you can just use value_counts
like this:
df['my_label'].value_counts(sort=False) / df.shape[0]
yielding
A 0.285714
C 0.142857
B 0.142857
E 0.142857
D 0.285714
Name: my_label, dtype: float64
For this small dataframe it is quite fast
%timeit df['my_label'].value_counts(sort=False) / df.shape[0]
1000 loops, best of 3: 770 µs per loop
As pointed out by @anmol, the last statement can also be simplified to
df['my_label'].value_counts(sort=False, normalize=True)
Try:
g = pd.DataFrame(['A','B','A','C','D','D','E'])
# Group by the contents of column 0
gg = g.groupby(0)
# Create a DataFrame with the counts of each letter
histo = gg.apply(lambda x: x.count())
# Add a new column that is the count / total number of elements
histo[1] = histo.astype(np.float)/len(g)
print histo
Output:
0 1
0
A 2 0.285714
B 1 0.142857
C 1 0.142857
D 2 0.285714
E 1 0.142857
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With