Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Odd behavior with median()?

Tags:

r

I'm noticing some inconsistent behavior when applying the median() function to dataframes. "Inconsistent behavior" usually means that I don't understand something, so I hope someone will be willing to clear this up for me.

I realize that some functions (e.g., min(), max()) convert the dataframe into a vector and return the corresponding value for the entire df while mean() and sd() return a value for each column. While a bit confusing, those differences in behavior don't cause many problems since most code would break if a scalar is returned instead of a vector. However, median() seems to be inconsistent. For example:

dat <- data.frame(x=1:100, y=2:101)
median(dat)

Returns a vector:[1] 50.5 51.5

But, sometimes it breaks:

dat2 <- data.frame(x=1:100, y=rnorm(100))
median(dat2)

Returns: [1] NA NA Warning messages: 1: In mean.default(X[[1L]], ...) : argument is not numeric or logical: returning NA 2: In mean.default(X[[2L]], ...) : argument is not numeric or logical: returning NA

However, median(dat2$x) and median(dat2$y) both yield the correct result.

Also consider the following:

dat3 <- data.frame(x=1:100, y=1:100)
dat4 <- data.frame(x=1:100, y=100:199)

In the above, median(dat3) returns [1] 50.5 NA while median(dat4) returns [1] 50.5 149.5! I would expect both or neither of these to work. So, I clearly am not understanding just how the median() function is working.

Further, functions like sd, mean(), min() and max() all yield their expected (if seemingly inconsistent) results in all of the above cases.

I know that I can use something like sapply(dat2, median) to get the necessary result, but am wondering why the R gods chose to implement these core stats functions in a way that, at least on the surface, seems inconsistent. I suspect that I, and probably other neophytes, are probably not understanding some fundamental concept, and I'd appreciate your insight.

like image 602
Jason B Avatar asked May 05 '11 18:05

Jason B


2 Answers

This exact phenomenon was recently discussed in the median and data frames thread on R-devel. The consensus seemed to be that the mean.data.frame method should be deprecated and users should rely on sapply.

like image 139
Joshua Ulrich Avatar answered Oct 02 '22 12:10

Joshua Ulrich


median hasn't got a metod for data.frame class objects, unlike mean. Use plyr package and colwise function to achieve desired result. Or use *apply function family.

> sapply(mtcars, median)                                                                                                     
    mpg     cyl    disp      hp    drat      wt    qsec      vs      am    gear                                              
 19.200   6.000 196.300 123.000   3.695   3.325  17.710   0.000   0.000   4.000                                              
   carb                                                                                                                      
  2.000                                                                                                                      
> colwise(median)(mtcars)                                                                                                    
   mpg cyl  disp  hp  drat    wt  qsec vs am gear carb                                                                       
1 19.2   6 196.3 123 3.695 3.325 17.71  0  0    4    2 
like image 36
aL3xa Avatar answered Oct 02 '22 14:10

aL3xa