I came across R's <code>range</code> function. It is for sure a useful tool and makes code more readable, but its speed can be doubled by replacing it with a simple one-liner including <code>min</code> and <code>max</code>. I did some benchmarks and the 'bad' performance of the range function surprised me. For comparison I wrote a function called <code>range2</code> which uses min and max (see code). Except for speed, are there any reasons why this function exists if it can be outperformed by a simple one-liner, which is also easily readable? <pre class="prettyprint"><code>require(microbenchmark) range2 <- function(x) c(min(x),max(x)) n <- 1000000 x <- rnorm(n) microbenchmark(range(x), range2(x)) #Unit: milliseconds # expr min lq mean median uq max neval cld # range(x) 4.696101 4.734751 5.321603 4.796301 4.814751 23.0646 100 b #range2(x) 2.477602 2.516101 2.542540 2.535051 2.544052 3.7636 100 a n <- 10000000 x <- rnorm(n) microbenchmark(range(x), range2(x)) # Unit: milliseconds # expr min lq mean median uq max neval cld # range(x) 47.3246 47.9498 58.27992 55.25795 61.98205 146.5100 100 b #range2(x) 24.7063 25.5021 25.59192 25.55245 25.63515 27.1088 100 a </code></pre> For sure this would be not the first bottleneck one wants to get rid of, since we are talking about milliseconds on a vector with 10,000,000 entries, but I expected <code>range</code> to be faster. My simple intuition was: <code>range</code> goes through the data one time and searches for the minimum and maximum at the same time, whereas my <code>range2</code> function goes through the data two times: One time to find the minimum and one time to find the maximum. Maybe someone can give some background about the implementation. Maybe the reason is that <code>min</code> and <code>max</code> are implemented in C and <code>range</code> is not? Addition: I've already talked about that with a friend of mine and he just made this function faster by implementing it in C++ via: <pre class="prettyprint"><code>#include <Rcpp.h> #include <float.h> using namespace Rcpp; // [[Rcpp::export]] NumericVector range3(NumericVector x) { int xs = x.size(); double minValue = FLT_MAX; double maxValue = FLT_MIN; for (int i =0; i < xs; i++) { if (x[i] < minValue) minValue = x[i]; if (x[i] > maxValue) maxValue = x[i]; } Rcpp::NumericVector result(2); result[0] = minValue; result[1] = maxValue; return result; } </code></pre> and this gives the following benchmarks: <pre class="prettyprint"><code>n <- 10000000 x <- rnorm(n) microbenchmark(range(x), range2(x) ,range3(x)) #Unit: milliseconds # expr min lq mean median uq max neval cld # range(x) 47.8583 48.30355 58.12575 55.3135 62.10295 149.9648 100 c # range2(x) 24.8211 25.53615 25.90920 25.6176 25.79175 42.4659 100 b # range3(x) 13.2458 13.30385 13.47175 13.3797 13.65410 14.3487 100 a </code></pre>

Here's the source for <code>range.default</code> (run R 3.6.1) <pre class="prettyprint"><code> > range.default function (..., na.rm = FALSE, finite = FALSE) { x <- c(..., recursive = TRUE) if (is.numeric(x)) { if (finite) x <- x[is.finite(x)] else if (na.rm) x <- x[!is.na(x)] c(min(x), max(x)) } else { if (finite) na.rm <- TRUE c(min(x, na.rm = na.rm), max(x, na.rm = na.rm)) } } </code></pre> You can see that it does a few extra checks before calling <code>c(min(x), max(x))</code> itself. It's not optimized for speed. It's just a user friendly function. It seems unlikely that those millisecond differences would be the source a performance bottleneck.

Why is the range() function slower than a combination of min and max?

Tags:

performance

range

r

max

min

I came across R's range function. It is for sure a useful tool and makes code more readable, but its speed can be doubled by replacing it with a simple one-liner including min and max.

I did some benchmarks and the 'bad' performance of the range function surprised me. For comparison I wrote a function called range2 which uses min and max (see code). Except for speed, are there any reasons why this function exists if it can be outperformed by a simple one-liner, which is also easily readable?

require(microbenchmark)

range2 <- function(x) c(min(x),max(x))  

n <- 1000000
x <- rnorm(n)
microbenchmark(range(x), range2(x))
#Unit: milliseconds
#  expr      min       lq     mean   median       uq     max neval cld
# range(x) 4.696101 4.734751 5.321603 4.796301 4.814751 23.0646   100   b
#range2(x) 2.477602 2.516101 2.542540 2.535051 2.544052  3.7636   100  a 

n <- 10000000
x <- rnorm(n)
microbenchmark(range(x), range2(x))
# Unit: milliseconds
#  expr     min      lq     mean   median       uq      max neval cld
# range(x) 47.3246 47.9498 58.27992 55.25795 61.98205 146.5100   100   b
#range2(x) 24.7063 25.5021 25.59192 25.55245 25.63515  27.1088   100  a

For sure this would be not the first bottleneck one wants to get rid of, since we are talking about milliseconds on a vector with 10,000,000 entries, but I expected range to be faster. My simple intuition was:

range goes through the data one time and searches for the minimum and maximum at the same time, whereas my range2 function goes through the data two times: One time to find the minimum and one time to find the maximum.

Maybe someone can give some background about the implementation. Maybe the reason is that min and max are implemented in C and range is not?

Addition: I've already talked about that with a friend of mine and he just made this function faster by implementing it in C++ via:

#include <Rcpp.h>
#include <float.h>
using namespace Rcpp;
// [[Rcpp::export]]
NumericVector range3(NumericVector x) {
  int xs = x.size();
  double minValue = FLT_MAX;
  double maxValue = FLT_MIN;
  for (int i =0; i < xs; i++) {
    if (x[i] < minValue) minValue = x[i];
    if (x[i] > maxValue) maxValue = x[i];
  }
  Rcpp::NumericVector result(2);
  result[0] = minValue;
  result[1] = maxValue;
  return result;
}

and this gives the following benchmarks:

n <- 10000000
x <- rnorm(n)
microbenchmark(range(x), range2(x) ,range3(x))
#Unit: milliseconds
#      expr     min       lq     mean  median       uq      max neval cld
#  range(x) 47.8583 48.30355 58.12575 55.3135 62.10295 149.9648   100   c
# range2(x) 24.8211 25.53615 25.90920 25.6176 25.79175  42.4659   100  b 
# range3(x) 13.2458 13.30385 13.47175 13.3797 13.65410  14.3487   100 a

560

asked Apr 03 '19 15:04

Jonas

1 Answers

Here's the source for range.default (run R 3.6.1)

 > range.default
function (..., na.rm = FALSE, finite = FALSE) 
{
    x <- c(..., recursive = TRUE)
    if (is.numeric(x)) {
        if (finite) 
            x <- x[is.finite(x)]
        else if (na.rm) 
            x <- x[!is.na(x)]
        c(min(x), max(x))
    }
    else {
        if (finite) 
            na.rm <- TRUE
        c(min(x, na.rm = na.rm), max(x, na.rm = na.rm))
    }
}

You can see that it does a few extra checks before calling c(min(x), max(x)) itself. It's not optimized for speed. It's just a user friendly function. It seems unlikely that those millisecond differences would be the source a performance bottleneck.

158

answered Nov 11 '22 06:11

MrFlick

Related questions
                            
                                Alternative to do.call for large datasets
                            
                                Datetime formatting in R3.0.3 and R3.1.3
                            
                                RODBC ERROR: Could not SQLExecDirect in mysql
                            
                                Embed shiny app into a jupyter notebook
                            
                                .Internal(La_rs(x,FALSE)) crashes R after long (reproducible) script on windows only
                            
                                Package has a VignetteBuilder field but no prebuilt vignette index
                            
                                Controlling placement of figures with the knitr::include_graphics function
                            
                                result of rpart is a root, but data shows Information Gain
                            
                                sidebarMenu does not function properly when using includeHTML
                            
                                Given a finite palette, how to sort from Hot to Cold, i.e. Diverging
                            
                                SQLilte query much slower through RSqlite than sqlite3 command-line interface
                            
                                Matching algorithms in R (bipartite matching, Hungarian algorithm)
                            
                                How insert clickable link in pdf with R
                            
                                Adding Total/Subtotal to the bottom of a DataTable in Shiny
                            
                                How is memory managed while overwriting R objects?
                            
                                How to generate performance stats of clustering from flexclust?
                            
                                "Embedded" data.frame in R. What is it, what is it called, why does it behave the way it does?
                            
                                I see it, but I don't believe it. Legal names in R, piping operations, and the dot
                            
                                R Highcharter Sankey nodes column property trouble
                            
                                Is it possible to get a progress bar with foreach and a "multicore-kind" of backend

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With