I am looking for an efficient way to subset two vectors with the same index vector. Efficient categories can be: <ul> <li>Typo-save</li> <li>Memory Consumption</li> <li>Readability</li> <li>Computation Time</li> </ul> Suppose I have two vectors <code>x</code> and <code>y</code> <pre class="prettyprint"><code>x <- 1:10 y <- 10:1 </code></pre> and I want to overwrite the values of <code>x</code> with those of <code>y</code> when <code>x</code> is smaller than <code>y</code>. I can do this with (1): <pre class="prettyprint"><code>x[x < y] <- y[x < y] </code></pre> But here I have to write <code>x < y</code> twice, what has disadvantages to have a typo and when making an update I can forget to do this on both sides. So I can create an index vector (2): <pre class="prettyprint"><code>idx <- x < y x[idx] <- y[idx] rm(idx) </code></pre> But here I create an additional vector which might need memory and time. I can also use a <code>for-loop</code> (3): <pre class="prettyprint"><code>for(i in seq_along(x)) { if(x[i] < y[i]) x[i] <- y[i] } </code></pre> This might be slow and I don't know if <code>seq_along(x)</code> allocates memory or not. I can use <code>delayedAssign</code> in an environment like (4): <pre class="prettyprint"><code>(function() { delayedAssign("idx", x < y) x[idx] <<- y[idx] })() </code></pre> or (5): <pre class="prettyprint"><code>evalq({ delayedAssign("idx", x < y) x[idx] <<- y[idx]}, envir = new.env(), enclos = parent.frame()) </code></pre> where I hope that <code>delayedAssign</code> does not create an <code>idx</code>vector in memory. There are several other possibilities already in base like: <pre class="prettyprint"><code>x <- ifelse(x < y, y, x) #(6) x <- sapply(seq_along(x), function(i) {if(x[i] < y[i]) y[i] else x[i]}) #(7) with(data.frame(idx = x < y), x[idx] <<- y[idx]) #(8) </code></pre> <code>x < y</code> can be replaced with <code>which(x < y)</code> what can reduce the vector size and improve execution time. In the end I'm not satisfied with all of those methods. Is there a recommended way to subset two vectors with one index vector? For me Typo-save and Memory Consumption are more important than Execution time. Is there a way to see the Memory Consumption of the different methods during execution like using <code>microbenchmark</code> to see execution time, or can it only be done by creating huge vectors and have a look on the system processes?

<blockquote> […] But here I create an additional vector </blockquote> No you’re not. In fact you are creating one fewer vector than in your previous code, because you’re only computing <code>x < y</code> once instead of twice. Incidentally, I’d see any explicit use of <code>rm</code> in code as a code smell. Instead, restrict the scope of the computation so that the <code>idx</code> variable is short-lived. To make this happen explicitly, you could use <code>local</code>1: <pre class="prettyprint"><code>x = local({ idx = x < y x[idx] = y[idx] }) </code></pre> (But as shown this would require re-assignment to <code>x</code> which incurs yet another copy that R is unlikely to optimise away; the alternative would be to use global reassignment via <code><<-</code> or <code>assign</code> inside the <code>local</code> call.) <blockquote> […] where I hope that <code>delayedAssign</code> does not create an <code>idx</code>vector in memory </blockquote> Again, what makes you think that? Of course it creates a vector in memory — after all, you’re subsequently using it. You might be thinking that the computation is performed lazily but while R has recently gained this feature via ALTREP, there are very few situations where such expressions are created automatically, and they aren’t relevant here. 1 Your use of <code>evalq</code> is similar, just more convoluted. <code>local</code> is a convenience wrapper around <code>eval.parent(quote(…))</code>.

If you define assignments functions such as the one below (improved from this question, and sorry I don't have a packaged version to propose) <pre class="prettyprint"><code>`<<-` <- function(e1, e2, value){ if(missing(value)) eval.parent(substitute(.Primitive("<<-")(e1, e2))) else { cond <- e1 < e2 if(any(cond)) replace(e1, cond, value) else e1 } } </code></pre> You can do <code>x < y <- y[x <y]</code> instead of <code>x[x < y] <- y[x <y]</code> Now if on top of it you use dotdot, which replaces <code>..</code> in the rhs by the lhs, you can do the following: <pre class="prettyprint"><code>library(dotdot) x <- 1:10 y <- 10:1 x < y := y[..] x # [1] 10 9 8 7 6 6 7 8 9 10 </code></pre> Not sure how readable it is, I'd usually use the modified <code><<-</code> for things like <code>x < 0 <- NA</code> and dotdot with a simple lhs, but it's probably the most compact you can get! Note that <code>x < y</code> will still be evaluated twice here, as dotdot was not coded with this weird case in mind :).

Recommended way to subset two vectors with the same index vector

Q: What is subsetting and indexing vector?

Subsetting and Indexing Vectors First, remember that a vector is just a single sequence of data elements (e.g., the numbers 1 to 10). These data elements are all the same type (e.g., all numbers, or all integers, or all text). The sequence can be of any length, even of length 1.

Q: How do I select a subset of a vector in R?

So, the way to tell R that you want to select some particular elements (i.e., a ‘subset’) from a vector is by placing an ‘index’ in square brackets immediately following the name of the vector. This index tells R which locations in the vector to pull out. For a simple example, try x[1:10]to view the first ten elements of x.

Q: What are the different types of index vectors?

So far, we’ve covered three types of index vectors—logical, positive integer, and negative integer. The only remaining type requires us to introduce the concept of ‘named’ elements. Create a numeric vector with three named elements using vect1 <- c(foo = 11, bar = 2, norf = NA).

Q: How to subset matrices and arrays?

And you can repeat these indices, as with integers. You can subset matrices (2d) and arrays (>2d) higher-dimensional structures very simply with an extension of the method used to index atomic vectors, by giving the ‘coordinate’ of each element, using brackets ( [) as before, separated by commas. E.g., for a 2d matrix: [row, col].

Tags:

performance

indexing

memory

r

subset

I am looking for an efficient way to subset two vectors with the same index vector. Efficient categories can be:

Typo-save
Memory Consumption
Readability
Computation Time

Suppose I have two vectors x and y

x <- 1:10
y <- 10:1

and I want to overwrite the values of x with those of y when x is smaller than y. I can do this with (1):

x[x < y] <- y[x < y]

But here I have to write x < y twice, what has disadvantages to have a typo and when making an update I can forget to do this on both sides. So I can create an index vector (2):

idx <- x < y
x[idx] <- y[idx]
rm(idx)

But here I create an additional vector which might need memory and time. I can also use a for-loop (3):

for(i in seq_along(x)) {
   if(x[i] < y[i]) x[i] <- y[i]
}

This might be slow and I don't know if seq_along(x) allocates memory or not. I can use delayedAssign in an environment like (4):

(function() {
  delayedAssign("idx", x < y)
  x[idx] <<- y[idx]
})()

or (5):

evalq({
  delayedAssign("idx", x < y)
  x[idx] <<- y[idx]}, envir = new.env(), enclos = parent.frame())

where I hope that delayedAssign does not create an idxvector in memory. There are several other possibilities already in base like:

x <- ifelse(x < y, y, x) #(6)
x <- sapply(seq_along(x), function(i) {if(x[i] < y[i]) y[i] else x[i]}) #(7)
with(data.frame(idx = x < y), x[idx] <<- y[idx]) #(8)

x < y can be replaced with which(x < y) what can reduce the vector size and improve execution time.

In the end I'm not satisfied with all of those methods.

Is there a recommended way to subset two vectors with one index vector? For me Typo-save and Memory Consumption are more important than Execution time.

Is there a way to see the Memory Consumption of the different methods during execution like using microbenchmark to see execution time, or can it only be done by creating huge vectors and have a look on the system processes?

314

asked Jun 25 '19 11:06

GKi

3 Answers

[…] But here I create an additional vector

No you’re not. In fact you are creating one fewer vector than in your previous code, because you’re only computing x < y once instead of twice.

Incidentally, I’d see any explicit use of rm in code as a code smell. Instead, restrict the scope of the computation so that the idx variable is short-lived. To make this happen explicitly, you could use local¹:

x = local({
    idx = x < y
    x[idx] = y[idx]
})

(But as shown this would require re-assignment to x which incurs yet another copy that R is unlikely to optimise away; the alternative would be to use global reassignment via <<- or assign inside the local call.)

[…] where I hope that delayedAssign does not create an idxvector in memory

Again, what makes you think that? Of course it creates a vector in memory — after all, you’re subsequently using it. You might be thinking that the computation is performed lazily but while R has recently gained this feature via ALTREP, there are very few situations where such expressions are created automatically, and they aren’t relevant here.

¹ Your use of evalq is similar, just more convoluted. local is a convenience wrapper around eval.parent(quote(…)).

answered Nov 10 '22 08:11

Konrad Rudolph

I'd like to present this benchmark, at least for the speed part. I hope it's all right and that it contributes something useful.

Benchmark

# Unit: milliseconds
# expr       min          lq       mean      median          uq        max neval cld
# M1    669.5364    772.9185   2360.234   2285.9725   2691.0597   6748.697    10  a 
# M2    460.5103    540.4875   1477.873    710.7905   2112.8210   4538.728    10  a 
# M2a   776.5886   1879.7114   3596.316   2482.3449   3514.3662  10049.601    10  a 
# M3   7530.8926   7556.8909   7589.172   7587.4924   7619.6962   7668.825    10  a 
# M4    442.4082    545.2067   1671.283    641.0817   2232.8275   6821.164    10  a 
# M5    572.0651    603.7959   1536.910    783.5842   1681.0030   6199.584    10  a 
# M6   2045.0549   2222.2072   5613.928   3949.3604   7877.2988  14514.625    10  a 
# M7 143646.6301 156567.2780 165822.856 165018.5859 166944.0531 221897.671    10   b
# M8    446.6539    552.2921   1044.842    827.3231   1766.1650   2168.388    10  a 
# M9    388.7266    406.7127    684.946    529.0503    554.9486   2093.648    10  a

Code

set.seed(42)
n <- 1e8
x <- sample(1:9, n, replace=TRUE)
y <- sample(1:9, n, replace=TRUE)

library(microbenchmark)
microbenchmark(M1=x[x < y] <- y[x < y],
               M2={
                 idx <- x < y
                 x[idx] <- y[idx]
               },
               M2a=local({
                 idx=x < y
                 x[idx]=y[idx]
               }),
               M3=for(i in seq_along(x)) {
                 if(x[i] < y[i]) x[i] <- y[i]
               },
               M4={
                 (function() {
                   delayedAssign("idx", x < y)
                   x[idx] <<- y[idx]
                 })()
               },
               M5=evalq({
                 delayedAssign("idx", x < y)
                 x[idx] <<- y[idx]}, envir=new.env(), enclos=parent.frame()),
               M6=ifelse(x < y, y, x),
               M7=sapply(seq_along(x), function(i) {
                 if(x[i] < y[i]) y[i] else x[i]
                 }),
               M8=with(data.frame(idx=x < y), x[idx] <<- y[idx]),
               M9=pmax(x, y),  # off topic
               times=5L)

answered Nov 10 '22 08:11

jay.sf

If you define assignments functions such as the one below (improved from this question, and sorry I don't have a packaged version to propose)

`<<-` <- function(e1, e2, value){
  if(missing(value)) 
    eval.parent(substitute(.Primitive("<<-")(e1, e2)))
  else {
    cond <- e1 < e2
    if(any(cond)) 
      replace(e1, cond, value)
    else e1
  }
}

You can do x < y <- y[x <y] instead of x[x < y] <- y[x <y]

Now if on top of it you use dotdot, which replaces .. in the rhs by the lhs, you can do the following:

library(dotdot)
x <- 1:10
y <- 10:1
x < y := y[..]
x
# [1] 10  9  8  7  6  6  7  8  9 10

Not sure how readable it is, I'd usually use the modified <<- for things like x < 0 <- NA and dotdot with a simple lhs, but it's probably the most compact you can get!

Note that x < y will still be evaluated twice here, as dotdot was not coded with this weird case in mind :).

answered Nov 10 '22 06:11

Moody_Mudskipper

Related questions
                            
                                R plotly ignoring text label alignment hjust
                            
                                How to interpret the probabilities (p0, p1) of the result of h2o.predict()
                            
                                Change output width of plotly chart size in R Markdown PDF output
                            
                                Loading shiny module only when menu items is clicked
                            
                                How can I have the search option based on typing letters in pickerInput using shinyWidgets?
                            
                                How to add a complex label with italics and a variable to ggplot?
                            
                                Stack a named Date list to data.frame
                            
                                Naive Bayes in Quanteda vs caret: wildly different results
                            
                                Mutliple formatted text on pptx by using officer package on R
                            
                                Image processing: Average grayscale images
                            
                                Unable to pass user inputs into R shiny modules
                            
                                R's equivalent of string.replace() in python
                            
                                Shiny widgets in DT Table
                            
                                R Mutate multiple columns with ifelse()-condition
                            
                                Reading numpy ndarrays into R?
                            
                                How to format the input of Shiny updated numericInput but not change the actual value?
                            
                                Extract p-value from checkresiduals function
                            
                                Converting unit abbreviations to numbers
                            
                                Change filename when downloading data from datatable R
                            
                                Using the R cut function - how do the breaks and labels options work

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With