EDIT: This question was solved as the function worked out when a typo was corrected. I corrected the typo and leave the example as a reference of possible use for others in the future. More efficient solutions are also suggested in the answers.
Original (corrected) post:
I would like to be able to make a function that performs a calculation for different subsets of a data, using a logical operator to define the sub sets.
I will give you a simplified example using a data frame containing 2 groups ("A" and "B") with 2 variables each:
df <- data.frame(matrix(0, ncol = 2, nrow = 4))
colnames(df) <- c("group","var")
df$group <- c("A","B")
df$var <- c(1,4,1,4)
To calculate e.g. the mean of the different groups, A and B, it is possible to use the logical operator to subset the data:==
>mean(df$var[df$group=="A"])
[1] 1
>mean(df$var[df$group=="B"])
[1] 4
This is of course easy to do with only a few groups, but if you have a larger dataset, it would be convenient to be able to make a function that calculates the mean for several different groups (providing the names of those for example in the form of a vector). My idea (which is obviously not right) of construction such a function would look something like this:
autoMean <- function (q) {
mean(df$var[df$group==q])
}
And be run like this, in order to get the means for the 2 groups, A and B:
groups<-c("A","B")
autoMean(groups)
Now, R does not complain when I define the function and it works fine. (But be aware that when running the function with multiple groups, the function will calculate the mean of the two means (or the total).)
So, putting the variable of a function inside a logical operator do work, opposed to what I believed when I posted this question.
There are other, possibly more elegant, ways of solving this kind of a problem presented in the kindly provided answers below.
Also:
aggregate(var ~ group, data=df, FUN=mean)
library(plyr)
ddply(df, .(group), summarize, mean=mean(var))
### add column with mean of each group
cbind(df, with(df, ave(var, group)))
Careful that calling something df overwrites the F Distribution in package:stats which is loaded by default.
Maybe you are looking for tapply:
tapply(X=df$var, INDEX=df$group, FUN=mean)
# A B
# 1 4
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With