Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Function for median similar to "which.max" and "which.min" / Extracting median rows from a data.frame

I occasionally need to extract specific rows from a data.frame based on values from one of the variables. R has built-in functions for maximum (which.max()) and minimum (which.min()) that allow me to easily extract those rows.

Is there an equivalent for median? Or is my best bet to just write my own function?

Here's an example data.frame and how I would use which.max() and which.min():

set.seed(1) # so you can reproduce this example
dat = data.frame(V1 = 1:10, V2 = rnorm(10), V3 = rnorm(10), 
                 V4 = sample(1:20, 10, replace=T))

# To return the first row, which contains the max value in V4
dat[which.max(dat$V4), ]
# To return the seventh row, which contains the min value in V4
dat[which.min(dat$V4), ]

For this particular example, since there are an even number of observations, I would need to have two rows returned, in this case, rows 2 and 10.

Update

It would seem that there is not a built-in function for this. As such, using the reply from Sacha as a starting point, I wrote this function:

which.median = function(x) {
  if (length(x) %% 2 != 0) {
    which(x == median(x))
  } else if (length(x) %% 2 == 0) {
    a = sort(x)[c(length(x)/2, length(x)/2+1)]
    c(which(x == a[1]), which(x == a[2]))
  }
}

I'm able to use it as follows:

# make one data.frame with an odd number of rows
dat2 = dat[-10, ]
# Median rows from 'dat' (even number of rows) and 'dat2' (odd number of rows)
dat[which.median(dat$V4), ]
dat2[which.median(dat2$V4), ]

Are there any suggestions to improve this?

like image 222
A5C1D2H2I1M1N2O1R2T1 Avatar asked Apr 21 '12 05:04

A5C1D2H2I1M1N2O1R2T1


People also ask

What is the median function in R?

In R, the median of a vector is calculated using the median() function. The function accepts a vector as an input. If there are an odd number of values in the vector, the function returns the middle value. If there are an even number of values in the vector, the function returns the average of the two medians.

How do you find the max and min of a variable in R?

In R, we can find the minimum or maximum value of a vector or data frame. We use the min() and max() function to find minimum and maximum value respectively. The min() function returns the minimum value of a vector or data frame. The max() function returns the maximum value of a vector or data frame.

How do you find the max value of a Dataframe in R?

We can find the maximum value index in a dataframe using the which. max() function. “$” is used to access particular column of a dataframe.


2 Answers

While Sacha's solution is quite general, the median (or other quantiles) are order statistics, so you can calculate the corresponding indices from order (x) (instead of sort (x) for the quantile values).

Looking into quantile, types 1 or 3 could be used, all others lead to (weighted) averages of two values in certain cases.

I chose type 3, and a bit of copy & paste from quantile leads to:

which.quantile <- function (x, probs, na.rm = FALSE){
  if (! na.rm & any (is.na (x)))
  return (rep (NA_integer_, length (probs)))

  o <- order (x)
  n <- sum (! is.na (x))
  o <- o [seq_len (n)]

  nppm <- n * probs - 0.5
  j <- floor(nppm)
  h <- ifelse((nppm == j) & ((j%%2L) == 0L), 0, 1)
  j <- j + h

  j [j == 0] <- 1
  o[j]
}

A little test:

> x <-c (2.34, 5.83, NA, 9.34, 8.53, 6.42, NA, 8.07, NA, 0.77)
> probs <- c (0, .23, .5, .6, 1)
> which.quantile (x, probs, na.rm = TRUE)
[1] 10  1  6  6  4
> x [which.quantile (x, probs, na.rm = TRUE)] == quantile (x, probs, na.rm = TRUE, type = 3)

  0%  23%  50%  60% 100% 
TRUE TRUE TRUE TRUE TRUE 

Here's your example:

> dat [which.quantile (dat$V4, c (0, .5, 1)),]
  V1         V2          V3 V4
7  7  0.4874291 -0.01619026  1
2  2  0.1836433  0.38984324 13
1  1 -0.6264538  1.51178117 17
like image 50
cbeleites unhappy with SX Avatar answered Sep 29 '22 12:09

cbeleites unhappy with SX


I think just:

which(dat$V4 == median(dat$V4))

But be careful there since the median takes the mean of two numbers if there isn't a single middle number. E.g. median(1:4) gives 2.5 which doesn't match any of the elements.

Edit

Here is a function which will give you either the element of the median or the first match to the mean of the median, similar to how which.min() gives you the first element that is equal to the minimum only:

whichmedian <- function(x) which.min(abs(x - median(x)))

For example:

> whichmedian(1:4)
[1] 2
like image 44
Sacha Epskamp Avatar answered Sep 29 '22 13:09

Sacha Epskamp