I am using by
to apply a function to a range columns of a data frame based on a factor. Everything works perfectly well if I use mean()
as the function but if I use median()
I get an error of the type "Error in median.default(x) : need numeric data" even if I don't have NAs in the data frame.
The line that works using mean()
:
by(iris[,1:3], iris$Species, function(x) mean(x,na.rm=T))
> by(iris[,1:3], iris$Species, function(x) mean(x,na.rm=T))
iris$Species: setosa
Sepal.Length Sepal.Width Petal.Length
5.006 3.428 1.462
------------------------------------------------------------
iris$Species: versicolor
Sepal.Length Sepal.Width Petal.Length
5.936 2.770 4.260
------------------------------------------------------------
iris$Species: virginica
Sepal.Length Sepal.Width Petal.Length
6.588 2.974 5.552
Warning messages:
1: mean(<data.frame>) is deprecated.
Use colMeans() or sapply(*, mean) instead.
2: mean(<data.frame>) is deprecated.
Use colMeans() or sapply(*, mean) instead.
3: mean(<data.frame>) is deprecated.
Use colMeans() or sapply(*, mean) instead.
But if I use median()
(note the na.rm=T option
):
> by(iris[,1:3], iris$Species, function(x) median(x,na.rm=T))
Error in median.default(x, na.rm = T) : need numeric data
However if instead of choosing the range [,1:3]
of columns I choose only one of the columns it works:
> by(iris[,1], iris$Species, function(x) median(x,na.rm=T))
iris$Species: setosa
[1] 5
------------------------------------------------------------
iris$Species: versicolor
[1] 5.9
------------------------------------------------------------
iris$Species: virginica
[1] 6.5
How can I achieve this behaviour while selecting a range of columns?
You are using a split-apply strategy when you use by
. The objects being passed to the function are dataframes and you are getting the warning and error because of the non-existence of median.data.frame
and the impending non-existence of mean.data.frame
. It might work better if you used aggregate
:
> aggregate(iris[,1:3], iris["Species"], function(x) mean(x,na.rm=T))
Species Sepal.Length Sepal.Width Petal.Length
1 setosa 5.006 3.428 1.462
2 versicolor 5.936 2.770 4.260
3 virginica 6.588 2.974 5.552
> aggregate(iris[,1:3], iris["Species"], function(x) median(x,na.rm=T))
Species Sepal.Length Sepal.Width Petal.Length
1 setosa 5.0 3.4 1.50
2 versicolor 5.9 2.8 4.35
3 virginica 6.5 3.0 5.55
aggregate
works on the column vectors individually and then tabulates the results.
The original question is answered. If, however, the range happens to be (instead) all columns except those specified as the independent variable in the formula, the dot formula notation works, and represents a nifty alternative:
> aggregate(. ~ Species, data = iris, mean)
Species Sepal.Length Sepal.Width Petal.Length Petal.Width
1 setosa 5.006 3.428 1.462 0.246
2 versicolor 5.936 2.770 4.260 1.326
3 virginica 6.588 2.974 5.552 2.026
> aggregate(. ~ Species, data = iris, median)
Species Sepal.Length Sepal.Width Petal.Length Petal.Width
1 setosa 5.0 3.4 1.50 0.2
2 versicolor 5.9 2.8 4.35 1.3
3 virginica 6.5 3.0 5.55 2.0
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With