Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Transform vs. aggregate in Pandas

When grouping a Pandas DataFrame, when should I use transform and when should I use aggregate? How do they differ with respect to their application in practice and which one do you consider more important?

like image 918
Sylvi0202 Avatar asked Dec 04 '16 11:12

Sylvi0202


People also ask

What is the difference between aggregating transforming and filtering data?

If you want to get a single value for each group, use aggregate() (or one of its shortcuts). If you want to get a subset of the original rows, use filter() .

What is transform in pandas?

Pandas Series: transform() function The transform() function is used to call function on self producing a Series with transformed values and that has the same axis length as self.

What is the difference between transform and apply pandas?

transform() can take a function, a string function, a list of functions, and a dict. However, apply() is only allowed a function. apply() works with multiple Series at a time. But, transform() is only allowed to work with a single Series at a time.

What does aggregate mean in pandas?

What are pandas aggregate functions? Similar to SQL, pandas also supports multiple aggregate functions that perform a calculation on a set of values (grouped data) and return a single value. An aggregate is a function where the values of multiple rows are grouped together to form a single summary value.


1 Answers

consider the dataframe df

df = pd.DataFrame(dict(A=list('aabb'), B=[1, 2, 3, 4], C=[0, 9, 0, 9])) 

enter image description here


groupby is the standard use aggregater

df.groupby('A').mean() 

enter image description here


maybe you want these values broadcast across the whole group and return something with the same index as what you started with.
use transform

df.groupby('A').transform('mean') 

enter image description here

df.set_index('A').groupby(level='A').transform('mean') 

enter image description here


agg is used when you have specific things you want to run for different columns or more than one thing run on the same column.

df.groupby('A').agg(['mean', 'std']) 

enter image description here

df.groupby('A').agg(dict(B='sum', C=['mean', 'prod'])) 

enter image description here

like image 87
piRSquared Avatar answered Oct 03 '22 20:10

piRSquared