I need to get the mean of all columns of a large data set using R, grouped by 2 variables.
Lets try it with mtcars:
library(dplyr)
g_mtcars <- group_by(mtcars, cyl, gear)
summarise(g_mtcars, mean (hp))
# Source: local data frame [8 x 3]
# Groups: cyl [?]
#
# cyl gear `mean(hp)`
# <dbl> <dbl> <dbl>
# 1 4 3 97.0000
# 2 4 4 76.0000
# 3 4 5 102.0000
# 4 6 3 107.5000
# 5 6 4 116.5000
# 6 6 5 175.0000
# 7 8 3 194.1667
# 8 8 5 299.5000
It works for "hp", but I need to get the mean for every other columns of mtcars (except "cyl" and "gear" that make a group).
The data set is large, with several columns. Typing it by hand, like this: summarise(g_mtcars, mean (hp), mean(drat), mean (wt),...)
is not practical.
Mean by group (A, B, C):A(mean) = Sum/Number of terms = 20/3 = 6.67. B(mean) = Sum/Number of terms = 14/3 = 4.67. C(mean) = Sum/Number of terms = 8/3 = 2.67.
How to group by mean in R? By using aggregate() from R base or group_by() function along with the summarise() from the dplyr package you can do the group by on dataframe on a specific column and get the average/mean of a column for each group.
In this method of computing, the mean of the given dataframe column user just need to call the colMeans function which is an in-build function in R language and pass the dataframe as its parameter, then this will be returning the mean of all the column in the provided dataframe to the user.
Click a cell below the column or to the right of the row of the numbers for which you want to find the average. On the HOME tab, click the arrow next to AutoSum > Average, and then press Enter.
Edit2: Recent version of dplyr
suggests using regular summarise
with across
function, as in:
library(dplyr)
mtcars %>%
group_by(cyl, gear) %>%
summarise(across(everything(), mean))
What you're looking for is either ?summarise_all
or ?summarise_each
from dplyr
Edit: full code:
library(dplyr)
mtcars %>%
group_by(cyl, gear) %>%
summarise_all("mean")
# Source: local data frame [8 x 11]
# Groups: cyl [?]
#
# cyl gear mpg disp hp drat wt qsec vs am carb
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 4 3 21.500 120.1000 97.0000 3.700000 2.465000 20.0100 1.0 0.00 1.000000
# 2 4 4 26.925 102.6250 76.0000 4.110000 2.378125 19.6125 1.0 0.75 1.500000
# 3 4 5 28.200 107.7000 102.0000 4.100000 1.826500 16.8000 0.5 1.00 2.000000
# 4 6 3 19.750 241.5000 107.5000 2.920000 3.337500 19.8300 1.0 0.00 1.000000
# 5 6 4 19.750 163.8000 116.5000 3.910000 3.093750 17.6700 0.5 0.50 4.000000
# 6 6 5 19.700 145.0000 175.0000 3.620000 2.770000 15.5000 0.0 1.00 6.000000
# 7 8 3 15.050 357.6167 194.1667 3.120833 4.104083 17.1425 0.0 0.00 3.083333
# 8 8 5 15.400 326.0000 299.5000 3.880000 3.370000 14.5500 0.0 1.00 6.000000
aggregate
is the easiest way to do this in base
:
aggregate(. ~ cyl + gear, data = mtcars, FUN = mean)
# cyl gear mpg disp hp drat wt qsec vs am carb
# 1 4 3 21.500 120.1000 97.0000 3.700000 2.465000 20.0100 1.0 0.00 1.000000
# 2 6 3 19.750 241.5000 107.5000 2.920000 3.337500 19.8300 1.0 0.00 1.000000
# 3 8 3 15.050 357.6167 194.1667 3.120833 4.104083 17.1425 0.0 0.00 3.083333
# 4 4 4 26.925 102.6250 76.0000 4.110000 2.378125 19.6125 1.0 0.75 1.500000
# 5 6 4 19.750 163.8000 116.5000 3.910000 3.093750 17.6700 0.5 0.50 4.000000
# 6 4 5 28.200 107.7000 102.0000 4.100000 1.826500 16.8000 0.5 1.00 2.000000
# 7 6 5 19.700 145.0000 175.0000 3.620000 2.770000 15.5000 0.0 1.00 6.000000
# 8 8 5 15.400 326.0000 299.5000 3.880000 3.370000 14.5500 0.0 1.00 6.000000
using data.table.(however you can't setDT(mtcars)
because binding is locked. copy it to a different name like dt and try
library(data.table)
mt_dt = as.data.table(mtcars)
mt_dt[ , lapply(.SD, mean) , by=c("cyl", "gear")]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With