Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Recommended way to subset two vectors with the same index vector

I am looking for an efficient way to subset two vectors with the same index vector. Efficient categories can be:

  • Typo-save
  • Memory Consumption
  • Readability
  • Computation Time

Suppose I have two vectors x and y

x <- 1:10
y <- 10:1

and I want to overwrite the values of x with those of y when x is smaller than y. I can do this with (1):

x[x < y] <- y[x < y]

But here I have to write x < y twice, what has disadvantages to have a typo and when making an update I can forget to do this on both sides. So I can create an index vector (2):

idx <- x < y
x[idx] <- y[idx]
rm(idx)

But here I create an additional vector which might need memory and time. I can also use a for-loop (3):

for(i in seq_along(x)) {
   if(x[i] < y[i]) x[i] <- y[i]
}

This might be slow and I don't know if seq_along(x) allocates memory or not. I can use delayedAssign in an environment like (4):

(function() {
  delayedAssign("idx", x < y)
  x[idx] <<- y[idx]
})()

or (5):

evalq({
  delayedAssign("idx", x < y)
  x[idx] <<- y[idx]}, envir = new.env(), enclos = parent.frame())

where I hope that delayedAssign does not create an idxvector in memory. There are several other possibilities already in base like:

x <- ifelse(x < y, y, x) #(6)
x <- sapply(seq_along(x), function(i) {if(x[i] < y[i]) y[i] else x[i]}) #(7)
with(data.frame(idx = x < y), x[idx] <<- y[idx]) #(8)

x < y can be replaced with which(x < y) what can reduce the vector size and improve execution time.

In the end I'm not satisfied with all of those methods.

Is there a recommended way to subset two vectors with one index vector? For me Typo-save and Memory Consumption are more important than Execution time.

Is there a way to see the Memory Consumption of the different methods during execution like using microbenchmark to see execution time, or can it only be done by creating huge vectors and have a look on the system processes?

like image 314
GKi Avatar asked Jun 25 '19 11:06

GKi


People also ask

What is subsetting and indexing vector?

Subsetting and Indexing Vectors First, remember that a vector is just a single sequence of data elements (e.g., the numbers 1 to 10). These data elements are all the same type (e.g., all numbers, or all integers, or all text). The sequence can be of any length, even of length 1.

How do I select a subset of a vector in R?

So, the way to tell R that you want to select some particular elements (i.e., a ‘subset’) from a vector is by placing an ‘index’ in square brackets immediately following the name of the vector. This index tells R which locations in the vector to pull out. For a simple example, try x[1:10]to view the first ten elements of x.

What are the different types of index vectors?

So far, we’ve covered three types of index vectors—logical, positive integer, and negative integer. The only remaining type requires us to introduce the concept of ‘named’ elements. Create a numeric vector with three named elements using vect1 <- c(foo = 11, bar = 2, norf = NA).

How to subset matrices and arrays?

And you can repeat these indices, as with integers. You can subset matrices (2d) and arrays (>2d) higher-dimensional structures very simply with an extension of the method used to index atomic vectors, by giving the ‘coordinate’ of each element, using brackets ( [) as before, separated by commas. E.g., for a 2d matrix: [row, col].


3 Answers

[…] But here I create an additional vector

No you’re not. In fact you are creating one fewer vector than in your previous code, because you’re only computing x < y once instead of twice.

Incidentally, I’d see any explicit use of rm in code as a code smell. Instead, restrict the scope of the computation so that the idx variable is short-lived. To make this happen explicitly, you could use local1:

x = local({
    idx = x < y
    x[idx] = y[idx]
})

(But as shown this would require re-assignment to x which incurs yet another copy that R is unlikely to optimise away; the alternative would be to use global reassignment via <<- or assign inside the local call.)

[…] where I hope that delayedAssign does not create an idxvector in memory

Again, what makes you think that? Of course it creates a vector in memory — after all, you’re subsequently using it. You might be thinking that the computation is performed lazily but while R has recently gained this feature via ALTREP, there are very few situations where such expressions are created automatically, and they aren’t relevant here.

1 Your use of evalq is similar, just more convoluted. local is a convenience wrapper around eval.parent(quote(…)).

like image 78
Konrad Rudolph Avatar answered Nov 10 '22 08:11

Konrad Rudolph


I'd like to present this benchmark, at least for the speed part. I hope it's all right and that it contributes something useful.

Benchmark

# Unit: milliseconds
# expr       min          lq       mean      median          uq        max neval cld
# M1    669.5364    772.9185   2360.234   2285.9725   2691.0597   6748.697    10  a 
# M2    460.5103    540.4875   1477.873    710.7905   2112.8210   4538.728    10  a 
# M2a   776.5886   1879.7114   3596.316   2482.3449   3514.3662  10049.601    10  a 
# M3   7530.8926   7556.8909   7589.172   7587.4924   7619.6962   7668.825    10  a 
# M4    442.4082    545.2067   1671.283    641.0817   2232.8275   6821.164    10  a 
# M5    572.0651    603.7959   1536.910    783.5842   1681.0030   6199.584    10  a 
# M6   2045.0549   2222.2072   5613.928   3949.3604   7877.2988  14514.625    10  a 
# M7 143646.6301 156567.2780 165822.856 165018.5859 166944.0531 221897.671    10   b
# M8    446.6539    552.2921   1044.842    827.3231   1766.1650   2168.388    10  a 
# M9    388.7266    406.7127    684.946    529.0503    554.9486   2093.648    10  a 

Code

set.seed(42)
n <- 1e8
x <- sample(1:9, n, replace=TRUE)
y <- sample(1:9, n, replace=TRUE)

library(microbenchmark)
microbenchmark(M1=x[x < y] <- y[x < y],
               M2={
                 idx <- x < y
                 x[idx] <- y[idx]
               },
               M2a=local({
                 idx=x < y
                 x[idx]=y[idx]
               }),
               M3=for(i in seq_along(x)) {
                 if(x[i] < y[i]) x[i] <- y[i]
               },
               M4={
                 (function() {
                   delayedAssign("idx", x < y)
                   x[idx] <<- y[idx]
                 })()
               },
               M5=evalq({
                 delayedAssign("idx", x < y)
                 x[idx] <<- y[idx]}, envir=new.env(), enclos=parent.frame()),
               M6=ifelse(x < y, y, x),
               M7=sapply(seq_along(x), function(i) {
                 if(x[i] < y[i]) y[i] else x[i]
                 }),
               M8=with(data.frame(idx=x < y), x[idx] <<- y[idx]),
               M9=pmax(x, y),  # off topic
               times=5L)
like image 36
jay.sf Avatar answered Nov 10 '22 08:11

jay.sf


If you define assignments functions such as the one below (improved from this question, and sorry I don't have a packaged version to propose)

`<<-` <- function(e1, e2, value){
  if(missing(value)) 
    eval.parent(substitute(.Primitive("<<-")(e1, e2)))
  else {
    cond <- e1 < e2
    if(any(cond)) 
      replace(e1, cond, value)
    else e1
  }
}

You can do x < y <- y[x <y] instead of x[x < y] <- y[x <y]

Now if on top of it you use dotdot, which replaces .. in the rhs by the lhs, you can do the following:

library(dotdot)
x <- 1:10
y <- 10:1
x < y := y[..]
x
# [1] 10  9  8  7  6  6  7  8  9 10

Not sure how readable it is, I'd usually use the modified <<- for things like x < 0 <- NA and dotdot with a simple lhs, but it's probably the most compact you can get!

Note that x < y will still be evaluated twice here, as dotdot was not coded with this weird case in mind :).

like image 20
Moody_Mudskipper Avatar answered Nov 10 '22 06:11

Moody_Mudskipper