I occasionally need to extract specific rows from a data.frame based on values from one of the variables. R
has built-in functions for maximum (which.max()
) and minimum (which.min()
) that allow me to easily extract those rows.
Is there an equivalent for median? Or is my best bet to just write my own function?
Here's an example data.frame and how I would use which.max()
and which.min()
:
set.seed(1) # so you can reproduce this example
dat = data.frame(V1 = 1:10, V2 = rnorm(10), V3 = rnorm(10),
V4 = sample(1:20, 10, replace=T))
# To return the first row, which contains the max value in V4
dat[which.max(dat$V4), ]
# To return the seventh row, which contains the min value in V4
dat[which.min(dat$V4), ]
For this particular example, since there are an even number of observations, I would need to have two rows returned, in this case, rows 2 and 10.
It would seem that there is not a built-in function for this. As such, using the reply from Sacha as a starting point, I wrote this function:
which.median = function(x) {
if (length(x) %% 2 != 0) {
which(x == median(x))
} else if (length(x) %% 2 == 0) {
a = sort(x)[c(length(x)/2, length(x)/2+1)]
c(which(x == a[1]), which(x == a[2]))
}
}
I'm able to use it as follows:
# make one data.frame with an odd number of rows
dat2 = dat[-10, ]
# Median rows from 'dat' (even number of rows) and 'dat2' (odd number of rows)
dat[which.median(dat$V4), ]
dat2[which.median(dat2$V4), ]
Are there any suggestions to improve this?
In R, the median of a vector is calculated using the median() function. The function accepts a vector as an input. If there are an odd number of values in the vector, the function returns the middle value. If there are an even number of values in the vector, the function returns the average of the two medians.
In R, we can find the minimum or maximum value of a vector or data frame. We use the min() and max() function to find minimum and maximum value respectively. The min() function returns the minimum value of a vector or data frame. The max() function returns the maximum value of a vector or data frame.
We can find the maximum value index in a dataframe using the which. max() function. “$” is used to access particular column of a dataframe.
While Sacha's solution is quite general, the median (or other quantiles) are order statistics, so you can calculate the corresponding indices from order (x)
(instead of sort (x)
for the quantile values).
Looking into quantile
, types 1 or 3 could be used, all others lead to (weighted) averages of two values in certain cases.
I chose type 3, and a bit of copy & paste from quantile
leads to:
which.quantile <- function (x, probs, na.rm = FALSE){
if (! na.rm & any (is.na (x)))
return (rep (NA_integer_, length (probs)))
o <- order (x)
n <- sum (! is.na (x))
o <- o [seq_len (n)]
nppm <- n * probs - 0.5
j <- floor(nppm)
h <- ifelse((nppm == j) & ((j%%2L) == 0L), 0, 1)
j <- j + h
j [j == 0] <- 1
o[j]
}
A little test:
> x <-c (2.34, 5.83, NA, 9.34, 8.53, 6.42, NA, 8.07, NA, 0.77)
> probs <- c (0, .23, .5, .6, 1)
> which.quantile (x, probs, na.rm = TRUE)
[1] 10 1 6 6 4
> x [which.quantile (x, probs, na.rm = TRUE)] == quantile (x, probs, na.rm = TRUE, type = 3)
0% 23% 50% 60% 100%
TRUE TRUE TRUE TRUE TRUE
Here's your example:
> dat [which.quantile (dat$V4, c (0, .5, 1)),]
V1 V2 V3 V4
7 7 0.4874291 -0.01619026 1
2 2 0.1836433 0.38984324 13
1 1 -0.6264538 1.51178117 17
I think just:
which(dat$V4 == median(dat$V4))
But be careful there since the median takes the mean of two numbers if there isn't a single middle number. E.g. median(1:4)
gives 2.5 which doesn't match any of the elements.
Here is a function which will give you either the element of the median or the first match to the mean of the median, similar to how which.min()
gives you the first element that is equal to the minimum only:
whichmedian <- function(x) which.min(abs(x - median(x)))
For example:
> whichmedian(1:4)
[1] 2
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With