Logo Questions Linux Laravel Mysql Ubuntu Git Menu

Why is the range() function slower than a combination of min and max?

I came across R's range function. It is for sure a useful tool and makes code more readable, but its speed can be doubled by replacing it with a simple one-liner including min and max.

I did some benchmarks and the 'bad' performance of the range function surprised me. For comparison I wrote a function called range2 which uses min and max (see code). Except for speed, are there any reasons why this function exists if it can be outperformed by a simple one-liner, which is also easily readable?


range2 <- function(x) c(min(x),max(x))  

n <- 1000000
x <- rnorm(n)
microbenchmark(range(x), range2(x))
#Unit: milliseconds
#  expr      min       lq     mean   median       uq     max neval cld
# range(x) 4.696101 4.734751 5.321603 4.796301 4.814751 23.0646   100   b
#range2(x) 2.477602 2.516101 2.542540 2.535051 2.544052  3.7636   100  a 

n <- 10000000
x <- rnorm(n)
microbenchmark(range(x), range2(x))
# Unit: milliseconds
#  expr     min      lq     mean   median       uq      max neval cld
# range(x) 47.3246 47.9498 58.27992 55.25795 61.98205 146.5100   100   b
#range2(x) 24.7063 25.5021 25.59192 25.55245 25.63515  27.1088   100  a

For sure this would be not the first bottleneck one wants to get rid of, since we are talking about milliseconds on a vector with 10,000,000 entries, but I expected range to be faster. My simple intuition was:

range goes through the data one time and searches for the minimum and maximum at the same time, whereas my range2 function goes through the data two times: One time to find the minimum and one time to find the maximum.

Maybe someone can give some background about the implementation. Maybe the reason is that min and max are implemented in C and range is not?

Addition: I've already talked about that with a friend of mine and he just made this function faster by implementing it in C++ via:

#include <Rcpp.h>
#include <float.h>
using namespace Rcpp;
// [[Rcpp::export]]
NumericVector range3(NumericVector x) {
  int xs = x.size();
  double minValue = FLT_MAX;
  double maxValue = FLT_MIN;
  for (int i =0; i < xs; i++) {
    if (x[i] < minValue) minValue = x[i];
    if (x[i] > maxValue) maxValue = x[i];
  Rcpp::NumericVector result(2);
  result[0] = minValue;
  result[1] = maxValue;
  return result;

and this gives the following benchmarks:

n <- 10000000
x <- rnorm(n)
microbenchmark(range(x), range2(x) ,range3(x))
#Unit: milliseconds
#      expr     min       lq     mean  median       uq      max neval cld
#  range(x) 47.8583 48.30355 58.12575 55.3135 62.10295 149.9648   100   c
# range2(x) 24.8211 25.53615 25.90920 25.6176 25.79175  42.4659   100  b 
# range3(x) 13.2458 13.30385 13.47175 13.3797 13.65410  14.3487   100 a
like image 560
Jonas Avatar asked Apr 03 '19 15:04


People also ask

How is Max different from Min?

The min is simply the lowest observation, while the max is the highest observation. Obviously, it is easiest to determine the min and max if the data are ordered from lowest to highest. So for our data, the min is 13 and the max is 110.

What is MIN MAX function?

The MIN() function returns the smallest value of the selected column. The MAX() function returns the largest value of the selected column.

1 Answers

Here's the source for range.default (run R 3.6.1)

 > range.default
function (..., na.rm = FALSE, finite = FALSE) 
    x <- c(..., recursive = TRUE)
    if (is.numeric(x)) {
        if (finite) 
            x <- x[is.finite(x)]
        else if (na.rm) 
            x <- x[!is.na(x)]
        c(min(x), max(x))
    else {
        if (finite) 
            na.rm <- TRUE
        c(min(x, na.rm = na.rm), max(x, na.rm = na.rm))

You can see that it does a few extra checks before calling c(min(x), max(x)) itself. It's not optimized for speed. It's just a user friendly function. It seems unlikely that those millisecond differences would be the source a performance bottleneck.

like image 158
MrFlick Avatar answered Nov 11 '22 06:11
