Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas - possible to aggregate two columns using two different aggregations?

I'm loading a csv file, which has the following columns: date, textA, textB, numberA, numberB

I want to group by the columns: date, textA and textB - but want to apply "sum" to numberA, but "min" to numberB.

data = pd.read_table("file.csv", sep=",", thousands=',') grouped = data.groupby(["date", "textA", "textB"], as_index=False) 

...but I cannot see how to then apply two different aggregate functions, to two different columns? I.e. sum(numberA), min(numberB)

like image 396
marcus adamski Avatar asked Sep 16 '13 21:09

marcus adamski


People also ask

Can you Groupby two columns pandas?

Pandas comes with a whole host of sql-like aggregation functions you can apply when grouping on one or more columns. This is Python's closest equivalent to dplyr's group_by + summarise logic.

Can Groupby be used with multiple columns?

Yes, it is possible to use MySQL GROUP BY clause with multiple columns just as we can use MySQL DISTINCT clause.

Do multiple aggregations Groupby pandas?

To apply more than one aggregation when using pandas GroupBy, you simply pass in a dictionary to the . agg function. In your dictionary, your key will be the column name and the value will be a list of operations you want to perform on the column. The result will be a DataFrame with a MultiIndex column.


1 Answers

The agg method can accept a dict, in which case the keys indicate the column to which the function is applied:

grouped.agg({'numberA':'sum', 'numberB':'min'}) 

For example,

import numpy as np import pandas as pd df = pd.DataFrame({'A': ['foo', 'bar', 'foo', 'bar',                          'foo', 'bar', 'foo', 'foo'],                    'B': ['one', 'one', 'two', 'three',                          'two', 'two', 'one', 'three'],                    'number A': np.arange(8),                    'number B': np.arange(8) * 2}) grouped = df.groupby('A')  print(grouped.agg({     'number A': 'sum',     'number B': 'min'})) 

yields

     number B  number A A                       bar         2         9 foo         0        19 

This also shows that Pandas can handle spaces in column names. I'm not sure what the origin of the problem was, but literal spaces should not have posed a problem. If you wish to investigate this further,

print(df.columns) 

without reassigning the column names, will show show us the repr of the names. Maybe there was a hard-to-see character in the column name that looked like a space (or some other character) but was actually a u'\xa0' (NO-BREAK SPACE), for example.

like image 90
unutbu Avatar answered Oct 04 '22 10:10

unutbu