I came across R's range
function. It is for sure a useful tool and makes code more readable, but its speed can be doubled by replacing it with a simple one-liner including min
and max
.
I did some benchmarks and the 'bad' performance of the range function surprised me. For comparison I wrote a function called range2
which uses min and max (see code). Except for speed, are there any reasons why this function exists if it can be outperformed by a simple one-liner, which is also easily readable?
require(microbenchmark)
range2 <- function(x) c(min(x),max(x))
n <- 1000000
x <- rnorm(n)
microbenchmark(range(x), range2(x))
#Unit: milliseconds
# expr min lq mean median uq max neval cld
# range(x) 4.696101 4.734751 5.321603 4.796301 4.814751 23.0646 100 b
#range2(x) 2.477602 2.516101 2.542540 2.535051 2.544052 3.7636 100 a
n <- 10000000
x <- rnorm(n)
microbenchmark(range(x), range2(x))
# Unit: milliseconds
# expr min lq mean median uq max neval cld
# range(x) 47.3246 47.9498 58.27992 55.25795 61.98205 146.5100 100 b
#range2(x) 24.7063 25.5021 25.59192 25.55245 25.63515 27.1088 100 a
For sure this would be not the first bottleneck one wants to get rid of, since we are talking about milliseconds on a vector with 10,000,000 entries, but I expected range
to be faster. My simple intuition was:
range
goes through the data one time and searches for the minimum and maximum at the same time, whereas my range2
function goes through the data two times: One time to find the minimum and one time to find the maximum.
Maybe someone can give some background about the implementation. Maybe the reason is that min
and max
are implemented in C and range
is not?
Addition: I've already talked about that with a friend of mine and he just made this function faster by implementing it in C++ via:
#include <Rcpp.h>
#include <float.h>
using namespace Rcpp;
// [[Rcpp::export]]
NumericVector range3(NumericVector x) {
int xs = x.size();
double minValue = FLT_MAX;
double maxValue = FLT_MIN;
for (int i =0; i < xs; i++) {
if (x[i] < minValue) minValue = x[i];
if (x[i] > maxValue) maxValue = x[i];
}
Rcpp::NumericVector result(2);
result[0] = minValue;
result[1] = maxValue;
return result;
}
and this gives the following benchmarks:
n <- 10000000
x <- rnorm(n)
microbenchmark(range(x), range2(x) ,range3(x))
#Unit: milliseconds
# expr min lq mean median uq max neval cld
# range(x) 47.8583 48.30355 58.12575 55.3135 62.10295 149.9648 100 c
# range2(x) 24.8211 25.53615 25.90920 25.6176 25.79175 42.4659 100 b
# range3(x) 13.2458 13.30385 13.47175 13.3797 13.65410 14.3487 100 a
The min is simply the lowest observation, while the max is the highest observation. Obviously, it is easiest to determine the min and max if the data are ordered from lowest to highest. So for our data, the min is 13 and the max is 110.
The MIN() function returns the smallest value of the selected column. The MAX() function returns the largest value of the selected column.
Here's the source for range.default
(run R 3.6.1)
> range.default
function (..., na.rm = FALSE, finite = FALSE)
{
x <- c(..., recursive = TRUE)
if (is.numeric(x)) {
if (finite)
x <- x[is.finite(x)]
else if (na.rm)
x <- x[!is.na(x)]
c(min(x), max(x))
}
else {
if (finite)
na.rm <- TRUE
c(min(x, na.rm = na.rm), max(x, na.rm = na.rm))
}
}
You can see that it does a few extra checks before calling c(min(x), max(x))
itself. It's not optimized for speed. It's just a user friendly function. It seems unlikely that those millisecond differences would be the source a performance bottleneck.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With