Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to sum a variable by group

I have a data frame with two columns. First column contains categories such as "First", "Second", "Third", and the second column has numbers that represent the number of times I saw the specific groups from "Category".

For example:

Category     Frequency First        10 First        15 First        5 Second       2 Third        14 Third        20 Second       3 

I want to sort the data by Category and sum all the Frequencies:

Category     Frequency First        30 Second       5 Third        34 

How would I do this in R?

like image 263
user5243421 Avatar asked Nov 02 '09 09:11

user5243421


People also ask

How do I sum by group in R?

Now we can use the group_by and the summarise_at functions to get the summation by group: iris %>% # Specify data frame group_by(Species) %>% # Specify group indicator summarise_at(vars(Sepal. Length), # Specify column list(name = sum)) # Specify function # A tibble: 3 x 2 # Species name # <fct> <dbl> # 1 setosa 250.

What does N () do in R?

The function n() returns the number of observations in a current group.


2 Answers

Using aggregate:

aggregate(x$Frequency, by=list(Category=x$Category), FUN=sum)   Category  x 1    First 30 2   Second  5 3    Third 34 

In the example above, multiple dimensions can be specified in the list. Multiple aggregated metrics of the same data type can be incorporated via cbind:

aggregate(cbind(x$Frequency, x$Metric2, x$Metric3) ... 

(embedding @thelatemail comment), aggregate has a formula interface too

aggregate(Frequency ~ Category, x, sum) 

Or if you want to aggregate multiple columns, you could use the . notation (works for one column too)

aggregate(. ~ Category, x, sum) 

or tapply:

tapply(x$Frequency, x$Category, FUN=sum)  First Second  Third      30      5     34  

Using this data:

x <- data.frame(Category=factor(c("First", "First", "First", "Second",                                       "Third", "Third", "Second")),                      Frequency=c(10,15,5,2,14,20,3)) 
like image 148
rcs Avatar answered Oct 02 '22 03:10

rcs


You can also use the dplyr package for that purpose:

library(dplyr) x %>%    group_by(Category) %>%    summarise(Frequency = sum(Frequency))  #Source: local data frame [3 x 2] # #  Category Frequency #1    First        30 #2   Second         5 #3    Third        34 

Or, for multiple summary columns (works with one column too):

x %>%    group_by(Category) %>%    summarise(across(everything(), sum)) 

Here are some more examples of how to summarise data by group using dplyr functions using the built-in dataset mtcars:

# several summary columns with arbitrary names mtcars %>%    group_by(cyl, gear) %>%                            # multiple group columns   summarise(max_hp = max(hp), mean_mpg = mean(mpg))  # multiple summary columns  # summarise all columns except grouping columns using "sum"  mtcars %>%    group_by(cyl) %>%    summarise(across(everything(), sum))  # summarise all columns except grouping columns using "sum" and "mean" mtcars %>%    group_by(cyl) %>%    summarise(across(everything(), list(mean = mean, sum = sum)))  # multiple grouping columns mtcars %>%    group_by(cyl, gear) %>%    summarise(across(everything(), list(mean = mean, sum = sum)))  # summarise specific variables, not all mtcars %>%    group_by(cyl, gear) %>%    summarise(across(c(qsec, mpg, wt), list(mean = mean, sum = sum)))  # summarise specific variables (numeric columns except grouping columns) mtcars %>%    group_by(gear) %>%    summarise(across(where(is.numeric), list(mean = mean, sum = sum))) 

For more information, including the %>% operator, see the introduction to dplyr.

like image 38
talat Avatar answered Oct 02 '22 03:10

talat