How to apply a function to a subset of columns in r?

Question

I am using by to apply a function to a range columns of a data frame based on a factor. Everything works perfectly well if I use mean() as the function but if I use median() I get an error of the type "Error in median.default(x) : need numeric data" even if I don't have NAs in the data frame.

The line that works using mean():

by(iris[,1:3], iris$Species, function(x) mean(x,na.rm=T))

> by(iris[,1:3], iris$Species, function(x) mean(x,na.rm=T))
iris$Species: setosa
Sepal.Length  Sepal.Width Petal.Length 
       5.006        3.428        1.462 
------------------------------------------------------------ 
iris$Species: versicolor
Sepal.Length  Sepal.Width Petal.Length 
       5.936        2.770        4.260 
------------------------------------------------------------ 
iris$Species: virginica
Sepal.Length  Sepal.Width Petal.Length 
       6.588        2.974        5.552 
Warning messages:
1: mean(<data.frame>) is deprecated.
 Use colMeans() or sapply(*, mean) instead. 
2: mean(<data.frame>) is deprecated.
 Use colMeans() or sapply(*, mean) instead. 
3: mean(<data.frame>) is deprecated.
 Use colMeans() or sapply(*, mean) instead.

But if I use median() (note the na.rm=T option):

> by(iris[,1:3], iris$Species, function(x) median(x,na.rm=T))
Error in median.default(x, na.rm = T) : need numeric data

However if instead of choosing the range [,1:3] of columns I choose only one of the columns it works:

> by(iris[,1], iris$Species, function(x) median(x,na.rm=T))
iris$Species: setosa
[1] 5
------------------------------------------------------------ 
iris$Species: versicolor
[1] 5.9
------------------------------------------------------------ 
iris$Species: virginica
[1] 6.5

How can I achieve this behaviour while selecting a range of columns?

IRTFM · Accepted Answer

You are using a split-apply strategy when you use by. The objects being passed to the function are dataframes and you are getting the warning and error because of the non-existence of median.data.frame and the impending non-existence of mean.data.frame. It might work better if you used aggregate:

> aggregate(iris[,1:3], iris["Species"], function(x) mean(x,na.rm=T))
     Species Sepal.Length Sepal.Width Petal.Length
1     setosa        5.006       3.428        1.462
2 versicolor        5.936       2.770        4.260
3  virginica        6.588       2.974        5.552
> aggregate(iris[,1:3], iris["Species"], function(x) median(x,na.rm=T))
     Species Sepal.Length Sepal.Width Petal.Length
1     setosa          5.0         3.4         1.50
2 versicolor          5.9         2.8         4.35
3  virginica          6.5         3.0         5.55

aggregate works on the column vectors individually and then tabulates the results.

Plantaloons · Answer

The original question is answered. If, however, the range happens to be (instead) all columns except those specified as the independent variable in the formula, the dot formula notation works, and represents a nifty alternative:

> aggregate(. ~ Species, data = iris, mean)
     Species Sepal.Length Sepal.Width Petal.Length Petal.Width
1     setosa        5.006       3.428        1.462       0.246
2 versicolor        5.936       2.770        4.260       1.326
3  virginica        6.588       2.974        5.552       2.026

> aggregate(. ~ Species, data = iris, median)
     Species Sepal.Length Sepal.Width Petal.Length Petal.Width
1     setosa          5.0         3.4         1.50         0.2
2 versicolor          5.9         2.8         4.35         1.3
3  virginica          6.5         3.0         5.55         2.0

How to apply a function to a subset of columns in r?

Tags:

r

pedrosaurio

2 Answers

IRTFM

Plantaloons

Recent Activity

Donate For Us

How to apply a function to a subset of columns in r?

Tags:

r

pedrosaurio

2 Answers

IRTFM

Plantaloons

Related questions

Recent Activity

Donate For Us