I'm noticing some inconsistent behavior when applying the median()
function to dataframes. "Inconsistent behavior" usually means that I don't understand something, so I hope someone will be willing to clear this up for me.
I realize that some functions (e.g., min()
, max()
) convert the dataframe into a vector and return the corresponding value for the entire df while mean()
and sd()
return a value for each column. While a bit confusing, those differences in behavior don't cause many problems since most code would break if a scalar is returned instead of a vector. However, median()
seems to be inconsistent. For example:
dat <- data.frame(x=1:100, y=2:101)
median(dat)
Returns a vector:[1] 50.5 51.5
But, sometimes it breaks:
dat2 <- data.frame(x=1:100, y=rnorm(100))
median(dat2)
Returns: [1] NA NA
Warning messages:
1: In mean.default(X[[1L]], ...) :
argument is not numeric or logical: returning NA
2: In mean.default(X[[2L]], ...) :
argument is not numeric or logical: returning NA
However, median(dat2$x)
and median(dat2$y)
both yield the correct result.
Also consider the following:
dat3 <- data.frame(x=1:100, y=1:100)
dat4 <- data.frame(x=1:100, y=100:199)
In the above, median(dat3)
returns [1] 50.5 NA
while median(dat4)
returns [1] 50.5 149.5
! I would expect both or neither of these to work. So, I clearly am not understanding just how the median()
function is working.
Further, functions like sd
, mean()
, min()
and max()
all yield their expected (if seemingly inconsistent) results in all of the above cases.
I know that I can use something like sapply(dat2, median)
to get the necessary result, but am wondering why the R gods chose to implement these core stats functions in a way that, at least on the surface, seems inconsistent. I suspect that I, and probably other neophytes, are probably not understanding some fundamental concept, and I'd appreciate your insight.
This exact phenomenon was recently discussed in the median and data frames thread on R-devel. The consensus seemed to be that the mean.data.frame
method should be deprecated and users should rely on sapply
.
median
hasn't got a metod for data.frame
class objects, unlike mean
. Use plyr
package and colwise
function to achieve desired result. Or use *apply
function family.
> sapply(mtcars, median)
mpg cyl disp hp drat wt qsec vs am gear
19.200 6.000 196.300 123.000 3.695 3.325 17.710 0.000 0.000 4.000
carb
2.000
> colwise(median)(mtcars)
mpg cyl disp hp drat wt qsec vs am gear carb
1 19.2 6 196.3 123 3.695 3.325 17.71 0 0 4 2
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With