My question is about groupby operation with pandas. I have the following DataFrame :
In [4]: df = pd.DataFrame({"A": range(4), "B": ["PO", "PO", "PA", "PA"], "C": ["Est", "Est", "West", "West"]}) In [5]: df Out[5]: A B C 0 0 PO Est 1 1 PO Est 2 2 PA West 3 3 PA West
This is what I would like to do : I want to group by column B and do a sum on column A. But at the end, I would like column C to still be in the DataFrame. If I do :
In [8]: df.groupby(by="B").aggregate(pd.np.sum) Out[8]: A B PA 5 PO 1
It does the job but column C is missing. I can also do this :
In [9]: df.groupby(by=["B", "C"]).aggregate(pd.np.sum) Out[9]: A B C PA West 5 PO Est 1
or
In [11]: df.groupby(by=["B", "C"], as_index=False).aggregate(pd.np.sum) Out[11]: B C A 0 PA West 5 1 PO Est 1
But in both cases it group by B AND C and not just B and keeps the C value. Is what I want to do irrelevant or is there a way to do it ?
groupby() can take the list of columns to group by multiple columns and use the aggregate functions to apply single or multiple aggregations at the same time.
Step 1: split the data into groups by creating a groupby object from the original DataFrame; Step 2: apply a function, in this case, an aggregation function that computes a summary statistic (you can also transform or filter your data in this step); Step 3: combine the results into a new DataFrame.
To select all columns except one column in Pandas DataFrame, we can use df. loc[:, df. columns != <column name>].
The “group by” process: split-apply-combine (1) Splitting the data into groups. (2). Applying a function to each group independently, (3) Combining the results into a data structure.
try to use DataFrameGroupBy.agg() method with dict of {column -> function}
:
In [6]: df.groupby('B').agg({'A':'sum', 'C':'first'}) Out[6]: C A B PA West 5 PO Est 1
From docs:
Function to use for aggregating groups. If a function, must either work when passed a DataFrame or when passed to DataFrame.apply. If passed a dict, the keys must be DataFrame column names.
or something like this depending on your goals:
In [8]: df = pd.DataFrame({"A": range(4), "B": ["PO", "PO", "PA", "PA"], "C": ["Est1", "Est2", "West1", "West2"]}) In [9]: df.groupby('B').agg({'A':'sum', 'C':'first'}) Out[9]: C A B PA West1 5 PO Est1 1 In [10]: df['sum_A'] = df.groupby('B')['A'].transform('sum') In [11]: df Out[11]: A B C sum_A 0 0 PO Est1 1 1 1 PO Est2 1 2 2 PA West1 5 3 3 PA West2 5
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With