Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas: .groupby().size() and percentages

I have a DataFrame that originates from a df.groupby().size() operation, and looks like this:

Localization                           RNA level      
cytoplasm                              1 Non-expressed     7
                                       2 Very low         13
                                       3 Low               8
                                       4 Medium            6
                                       5 Moderate          8
                                       6 High              2
                                       7 Very high         6
cytoplasm & nucleus                    1 Non-expressed     5
                                       2 Very low          8
                                       3 Low               2
                                       4 Medium           10
                                       5 Moderate         16
                                       6 High              6
                                       7 Very high         5
cytoplasm & nucleus & plasma membrane  1 Non-expressed     6
                                       2 Very low          3
                                       3 Low               3
                                       4 Medium            7
                                       5 Moderate          8
                                       6 High              4
                                       7 Very high         1

What I want to do is to calculate the separate occurrences (i.e. the last column coming from .size()) as a percentage of the total number of occurrences in the applicable Localization.

For example: there are a total of 50 occurrences in the cytoplasm localisation (7 + 13 + 8 + 6 + 8 + 2 + 6), yielding 14 and 26 % for the Non-expressed and Very low RNA-levels, respectively.

Is there a nice way of doing this? I've been going about it with what I think is a very roundabout way, i.e. making a new DataFrame for every Localization and working on from there, but there's a lot of lines and the problem of having to merge all the resulting DataFrames in the end. I'm hoping there's a smarter way of doing it, at least!

like image 688
erikfas Avatar asked May 13 '14 09:05

erikfas


People also ask

How do you get percentage on Groupby pandas?

You can caluclate pandas percentage with total by groupby() and DataFrame. transform() method. The transform() method allows you to execute a function for each value of the DataFrame. Here, the percentage directly summarized DataFrame, then the results will be calculated using all the data.

What does Groupby size () return?

1) Using pandas groupby size() method The most simple method for pandas groupby count is by using the in-built pandas method named size(). It returns a pandas series that possess the total number of row count for each group.

What is possible using Groupby () method of pandas?

groupby() function is used to split the data into groups based on some criteria. pandas objects can be split on any of their axes. The abstract definition of grouping is to provide a mapping of labels to group names. sort : Sort group keys.

What are the three phases of the pandas Groupby () function?

(1) Splitting the data into groups. (2). Applying a function to each group independently, (3) Combining the results into a data structure.


1 Answers

Here is the complete example based on pandas groupby, sum functions. The basic idea is to group data based on 'Localization' and to apply a function on group.

import pandas as pd
from io import StringIO
#For Python 2, replace previous line with: from StringIO import StringIO

data = \
"""Localization,RNA level,Size
cytoplasm                            ,1 Non-expressed, 7
cytoplasm                            ,2 Very low     ,13
cytoplasm                            ,3 Low          , 8
cytoplasm                            ,4 Medium       , 6
cytoplasm                            ,5 Moderate     , 8
cytoplasm                            ,6 High         , 2
cytoplasm                            ,7 Very high    , 6
cytoplasm & nucleus                  ,1 Non-expressed, 5
cytoplasm & nucleus                  ,2 Very low     , 8
cytoplasm & nucleus                  ,3 Low          , 2
cytoplasm & nucleus                  ,4 Medium       ,10
cytoplasm & nucleus                  ,5 Moderate     ,16
cytoplasm & nucleus                  ,6 High         , 6
cytoplasm & nucleus                  ,7 Very high    , 5
cytoplasm & nucleus & plasma membrane,1 Non-expressed, 6
cytoplasm & nucleus & plasma membrane,2 Very low     , 3
cytoplasm & nucleus & plasma membrane,3 Low          , 3
cytoplasm & nucleus & plasma membrane,4 Medium       , 7
cytoplasm & nucleus & plasma membrane,5 Moderate     , 8
cytoplasm & nucleus & plasma membrane,6 High         , 4
cytoplasm & nucleus & plasma membrane,7 Very high    , 1"""

# Create the dataframe
df = pd.read_csv(StringIO(data))
df['Localization'].str.strip()
df['RNA level'].str.strip()
df['Size'].astype(int)
df['Percent'] = df.groupby('Localization')['Size'].transform(lambda x: x/sum(x))
like image 168
Guillaume Jacquenot Avatar answered Oct 06 '22 14:10

Guillaume Jacquenot