Fastest way to categorize integer data

Question

I have a mapping given in the table below:

Input  Output
<4     0
5      0.4
6      0.5
7      0.65
8      0.75
9      0.85
>=10   1

Until now, I wrote 3 versions:

k1 <- function(h) {
  if (h <= 4) { k <- 0
  } else if (h == 5) { k <- 0.4
  } else if (h == 6) { k <- 0.5
  } else if (h == 7) { k <- 0.65
  } else if (h == 8) { k <- 0.75
  } else if (h == 9) { k <- 0.85
  } else if (h >= 10) { k <- 1}
  return(k)
}

Second:

k2 <- function(h) {
  k <- 0
  k[h == 5] <- 0.4
  k[h == 6] <- 0.5
  k[h == 7] <- 0.65
  k[h == 8] <- 0.75
  k[h == 9] <- 0.85
  k[h >= 10] <- 1.0
  return(k)
}

Third:

k3 <- function(h) {
  k <- cut(h, breaks=c(0, 5, 6, 7, 8, 9, Inf), labels=c(0, 0.5, 0.65, 0.75, 0.85, 1), right=FALSE)  
  return(k)
}

I need the function in two different scenarios. First, to evaluate a scalar input and second, to evaluate a vector of values.

For scalar input:

h <- 5
microbenchmark(k1(h), k2(h), k3(h))

Unit: microseconds
  expr    min      lq     mean  median     uq     max neval
 k1(h)  1.208  1.5110  2.38264  1.8125  2.114  15.698   100
 k2(h)  4.529  5.5855  8.71286  6.3400  7.849  73.053   100
 k3(h) 52.224 54.0360 71.74953 68.9785 79.393 304.286   100

For vector input:

h <- rep(5, 250)
microbenchmark(sapply(h, k1), k2(h), k3(h))
Unit: microseconds
          expr     min      lq     mean   median       uq     max neval
 sapply(h, k1) 595.592 617.327 641.8598 637.8535 654.9100 857.918   100
         k2(h)  15.397  17.207  19.5470  18.1130  19.6225  49.508   100
         k3(h) 110.486 116.070 131.3117 121.2020 140.6720 275.910   100

Thus, k1 is fastest for scalar input and k2 for vector input.

Do you see any possibility to improve the speed? I cannot believe that such a clumsy if / else code should be fastest in the scalar case. Moreover, I'd like to have a unified function and not two separate ones.

David Robinson · Accepted Answer

First, why are you optimizing for a few microseconds on a scalar input? If the answer is "because the scalar version must be called many, many times," perhaps in a loop, than that's the problem; the operation should be vectorized. (Notice that your k2 can process 250 inputs in the time it would take k1 to process just 15).

In any case, an alternative is:

outputs <- c(0, .4, .5, .65, .75, .85, 1)
k4 <- function(h) {
  output[pmin.int(pmax.int(h, 4), 10) - 3]
}

On my system this just about ties k2 in the vectorized case, but it's about twice as fast as k2 in the scalar case:

h <- 5
microbenchmark(k1(h), k2(h), k3(h), k4(h))
# Unit: nanoseconds
#   expr   min      lq     mean  median      uq    max neval
#  k1(h)   748   933.5  1314.29  1181.5  1655.0   3091   100
#  k2(h)  4131  5424.5  6378.31  6236.5  7021.5  18140   100
#  k3(h) 72149 74495.0 79796.22 75716.0 80936.5 176857   100
#  k4(h)  1730  2259.5  3396.04  3338.5  3801.0  17001   100

h <- rep(5, 250)
microbenchmark(sapply(h, k1), k2(h), k3(h), k4(h))
# Unit: microseconds
#           expr     min       lq      mean   median       uq     max neval
#  sapply(h, k1) 311.099 327.5710 341.05200 335.9330 348.6405 405.830   100
#          k2(h)  13.973  18.4965  20.64351  20.4160  22.4015  34.289   100
#          k3(h) 117.401 125.0180 134.49228 129.2455 138.8240 241.896   100
#          k4(h)  15.042  17.8870  20.33141  19.0690  20.4260  37.386   100

It is also more concise than k2 and easier to extend to a greater number of integer inputs.

Finally, if you're willing to rely on Rcpp, you can get a 5x speedup relative to k2 and k4:

library(Rcpp)
cppFunction('NumericVector k5(IntegerVector h) {
              int n = h.size();
              NumericVector out(n);

              for (int i = 0; i < n; ++i) {
                int val = h[i];
                if (val <= 4) out[i] = 0;
                else if (val == 5) out[i] = .4;
                else if (val == 6) out[i] = .5;
                else if (val == 7) out[i] = .65;
                else if (val == 8) out[i] = .75;
                else if (val == 9) out[i] = .85;
                else if (val >= 10) out[i] = 1;
              }
              return out;
            }')

h <- rep(5, 250)
microbenchmark(sapply(h, k1), k2(h), k3(h), k4(h), k5(h))

# Unit: microseconds
#           expr     min       lq      mean   median       uq     max neval
#  sapply(h, k1) 382.383 410.7310 429.88844 423.7150 442.5765 501.400   100
#          k2(h)  17.129  20.5865  23.95221  22.1340  23.7915  46.827   100
#          k3(h) 123.519 127.6830 142.24084 140.5400 150.1525 218.919   100
#          k4(h)  15.168  18.2705  20.45797  19.1985  20.6105  52.650   100
#          k5(h)   2.988   4.9045   6.49218   5.9135   6.8455  33.219   100

(See the "Vector input, vector output" section of this Rcpp intro as a guide for functions like these). However, note that it's still 2x slower than k1 in the scalar case!

Khashaa · Answer

findInterval is the fastest function in R.

out <- c(0, .4, .5, .65, .75, .85, 1)
k6 <- function(h){
    ind <- findInterval(h, c(4, 5, 6, 7, 8, 9) +0.1) + 1
    out[ind]
}

h <- rep(5, 250)
microbenchmark(k2(h), k4(h), k6(h), unit="relative")
Unit: relative
#  expr      min       lq     mean   median       uq      max neval
# k2(h) 2.283983 2.347714 2.225037 2.392578 2.319125 1.184224   100
# k4(h) 1.830939 1.725286 1.699866 1.701196 1.599973 1.414026   100
# k6(h) 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000   100

Fastest way to categorize integer data

Tags:

r

BayerSe

2 Answers

David Robinson

Khashaa

Recent Activity

Donate For Us

Fastest way to categorize integer data

Tags:

r

BayerSe

2 Answers

David Robinson

Khashaa

Related questions

Recent Activity

Donate For Us