Fast bounding of data in R

Tags:

Suppose I have a vector, vec, which is long (starting at 1E8 entries) and would like to bound it to the range [a,b]. I can certainly code vec[vec < a] = a and vec[vec > b] = b, but this requires two passes over the data and a large RAM allocation for the temporary indicator vector (~800MB, twice). The two passes burn time because we can do better if we copy data from main memory to the local cache just once (calls to main memory are bad, as are cache misses). And who knows how much this could be improved with multiple threads, but let's not get greedy. :)

Is there a nice implementation in base R or some package that I'm overlooking, or is this a job for Rcpp (or my old friend data.table)?

514

asked May 06 '12 20:05

Iterator

2 Answers

A naive C solution is

library(inline)

fun4 <-
    cfunction(c(x="numeric", a="numeric", b="numeric"), body4,
              language="C")
body4 <- "
    R_len_t len = Rf_length(x);
    SEXP result = Rf_allocVector(REALSXP, len);
    const double aa = REAL(a)[0], bb = REAL(b)[0], *xp = REAL(x);
    double *rp = REAL(result);

    for (int i = 0; i < len; ++i)
        if (xp[i] < aa)
            rp[i] = aa;
        else if (xp[i] > bb)
            rp[i] = bb;
        else
            rp[i] = xp[i];

    return result;
"
fun4 <-
    cfunction(c(x="numeric", a="numeric", b="numeric"), body4,
              language="C")

With a simple parallel version (as Dirk points out, this is with CFLAGS = -fopenmp in ~/.R/Makevars, and on a platform / compiler supporting openmp)

body5 <- "
    R_len_t len = Rf_length(x);
    const double aa = REAL(a)[0], bb = REAL(b)[0], *xp = REAL(x);
    SEXP result = Rf_allocVector(REALSXP, len);
    double *rp = REAL(result);

#pragma omp parallel for
    for (int i = 0; i < len; ++i)
        if (xp[i] < aa)
            rp[i] = aa;
        else if (xp[i] > bb)
            rp[i] = bb;
        else
            rp[i] = xp[i];

    return result;
"
fun5 <-
    cfunction(c(x="numeric", a="numeric", b="numeric"), body5,
              language="C")

And benchmarks

> z <- runif(1e7)
> benchmark(fun1(z,0.25,0.75), fun4(z, .25, .75), fun5(z, .25, .75),
+           replications=10)
                 test replications elapsed  relative user.self sys.self
1 fun1(z, 0.25, 0.75)           10   9.087 14.609325     8.335    0.739
2 fun4(z, 0.25, 0.75)           10   1.505  2.419614     1.305    0.198
3 fun5(z, 0.25, 0.75)           10   0.622  1.000000     2.156    0.320
  user.child sys.child
1          0         0
2          0         0
3          0         0
> identical(res1 <- fun1(z,0.25,0.75), fun4(z,0.25,0.75))
[1] TRUE
> identical(res1, fun5(z, 0.25, 0.75))
[1] TRUE

on my quad-core laptop. Assumes numeric input, no error checking, NA handling, etc.

196

answered Oct 20 '22 18:10

Martin Morgan

Just to start things off: not much difference between your solution and the pmin/pmax solution (trying things out with n=1e7 rather than n=1e8 because I'm impatient) -- pmin/pmax is actually marginally slower.

fun1 <- function(x,a,b) {x[x<a] <- a; x[x>b] <- b; x}
fun2 <- function(x,a,b) pmin(pmax(x,a),b)
library(rbenchmark)
z <- runif(1e7)

benchmark(fun1(z,0.25,0.75),fun2(z,0.25,0.75),rep=50)

                 test replications elapsed relative user.self sys.self
1 fun1(z, 0.25, 0.75)           10  21.607  1.00000     6.556   15.001
2 fun2(z, 0.25, 0.75)           10  23.336  1.08002     5.656   17.605

answered Oct 20 '22 18:10

Ben Bolker

Related questions
                            
                                Fast vectorized rsqrt and reciprocal with SSE/AVX depending on precision
                            
                                Slow len function on dask distributed dataframe
                            
                                Why is this improved sieve slower with pypy?
                            
                                Can the Intel performance monitor counters be used to measure memory bandwidth?
                            
                                OrderBy and Top in LINQ with good performance
                            
                                (1+N) selects with OnetoOne associations
                            
                                Is "SELECT COUNT(column)" faster/slower than "SELECT COUNT(*)"? [duplicate]
                            
                                OpenGL performance for 10,000 static cubes
                            
                                Python: List Comprehensions vs. map
                            
                                What is the default for FACELETS_REFRESH_PERIOD?
                            
                                Postgres: Optimizing querying by datetime
                            
                                Get Firebase child nodes' names without getting their children too in Firebase response?
                            
                                Faster reading of time series from netCDF?
                            
                                Slow rasterization in Dev Tools
                            
                                How is the CLR faster than me when calling Windows API
                            
                                Time complexity of stream filter
                            
                                WPF Data virtualizing ListView
                            
                                SQL Server Latches and their indication of performance issues
                            
                                How does Android emulator performance compare to real device performance?
                            
                                MySQL database backup: performance issues

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Fast bounding of data in R

Tags:

performance

r

data.table

rcpp

bigdata

Iterator

People also ask

2 Answers

Martin Morgan

Ben Bolker

Recent Activity

Donate For Us