I am looking for an efficient way to subset two vectors with the same index vector. Efficient categories can be:
Suppose I have two vectors x
and y
x <- 1:10
y <- 10:1
and I want to overwrite the values of x
with those of y
when x
is smaller than y
. I can do this with (1):
x[x < y] <- y[x < y]
But here I have to write x < y
twice, what has disadvantages to have a typo and when making an update I can forget to do this on both sides. So I can create an index vector (2):
idx <- x < y
x[idx] <- y[idx]
rm(idx)
But here I create an additional vector which might need memory and time. I can also use a for-loop
(3):
for(i in seq_along(x)) {
if(x[i] < y[i]) x[i] <- y[i]
}
This might be slow and I don't know if seq_along(x)
allocates memory or not.
I can use delayedAssign
in an environment like (4):
(function() {
delayedAssign("idx", x < y)
x[idx] <<- y[idx]
})()
or (5):
evalq({
delayedAssign("idx", x < y)
x[idx] <<- y[idx]}, envir = new.env(), enclos = parent.frame())
where I hope that delayedAssign
does not create an idx
vector in memory. There are several other possibilities already in base like:
x <- ifelse(x < y, y, x) #(6)
x <- sapply(seq_along(x), function(i) {if(x[i] < y[i]) y[i] else x[i]}) #(7)
with(data.frame(idx = x < y), x[idx] <<- y[idx]) #(8)
x < y
can be replaced with which(x < y)
what can reduce the vector size and improve execution time.
In the end I'm not satisfied with all of those methods.
Is there a recommended way to subset two vectors with one index vector? For me Typo-save and Memory Consumption are more important than Execution time.
Is there a way to see the Memory Consumption of the different methods during execution like using microbenchmark
to see execution time, or can it only be done by creating huge vectors and have a look on the system processes?
Subsetting and Indexing Vectors First, remember that a vector is just a single sequence of data elements (e.g., the numbers 1 to 10). These data elements are all the same type (e.g., all numbers, or all integers, or all text). The sequence can be of any length, even of length 1.
So, the way to tell R that you want to select some particular elements (i.e., a ‘subset’) from a vector is by placing an ‘index’ in square brackets immediately following the name of the vector. This index tells R which locations in the vector to pull out. For a simple example, try x[1:10]to view the first ten elements of x.
So far, we’ve covered three types of index vectors—logical, positive integer, and negative integer. The only remaining type requires us to introduce the concept of ‘named’ elements. Create a numeric vector with three named elements using vect1 <- c(foo = 11, bar = 2, norf = NA).
And you can repeat these indices, as with integers. You can subset matrices (2d) and arrays (>2d) higher-dimensional structures very simply with an extension of the method used to index atomic vectors, by giving the ‘coordinate’ of each element, using brackets ( [) as before, separated by commas. E.g., for a 2d matrix: [row, col].
[…] But here I create an additional vector
No you’re not. In fact you are creating one fewer vector than in your previous code, because you’re only computing x < y
once instead of twice.
Incidentally, I’d see any explicit use of rm
in code as a code smell. Instead, restrict the scope of the computation so that the idx
variable is short-lived. To make this happen explicitly, you could use local
1:
x = local({
idx = x < y
x[idx] = y[idx]
})
(But as shown this would require re-assignment to x
which incurs yet another copy that R is unlikely to optimise away; the alternative would be to use global reassignment via <<-
or assign
inside the local
call.)
[…] where I hope that
delayedAssign
does not create anidx
vector in memory
Again, what makes you think that? Of course it creates a vector in memory — after all, you’re subsequently using it. You might be thinking that the computation is performed lazily but while R has recently gained this feature via ALTREP, there are very few situations where such expressions are created automatically, and they aren’t relevant here.
1 Your use of evalq
is similar, just more convoluted. local
is a convenience wrapper around eval.parent(quote(…))
.
I'd like to present this benchmark, at least for the speed part. I hope it's all right and that it contributes something useful.
# Unit: milliseconds
# expr min lq mean median uq max neval cld
# M1 669.5364 772.9185 2360.234 2285.9725 2691.0597 6748.697 10 a
# M2 460.5103 540.4875 1477.873 710.7905 2112.8210 4538.728 10 a
# M2a 776.5886 1879.7114 3596.316 2482.3449 3514.3662 10049.601 10 a
# M3 7530.8926 7556.8909 7589.172 7587.4924 7619.6962 7668.825 10 a
# M4 442.4082 545.2067 1671.283 641.0817 2232.8275 6821.164 10 a
# M5 572.0651 603.7959 1536.910 783.5842 1681.0030 6199.584 10 a
# M6 2045.0549 2222.2072 5613.928 3949.3604 7877.2988 14514.625 10 a
# M7 143646.6301 156567.2780 165822.856 165018.5859 166944.0531 221897.671 10 b
# M8 446.6539 552.2921 1044.842 827.3231 1766.1650 2168.388 10 a
# M9 388.7266 406.7127 684.946 529.0503 554.9486 2093.648 10 a
set.seed(42)
n <- 1e8
x <- sample(1:9, n, replace=TRUE)
y <- sample(1:9, n, replace=TRUE)
library(microbenchmark)
microbenchmark(M1=x[x < y] <- y[x < y],
M2={
idx <- x < y
x[idx] <- y[idx]
},
M2a=local({
idx=x < y
x[idx]=y[idx]
}),
M3=for(i in seq_along(x)) {
if(x[i] < y[i]) x[i] <- y[i]
},
M4={
(function() {
delayedAssign("idx", x < y)
x[idx] <<- y[idx]
})()
},
M5=evalq({
delayedAssign("idx", x < y)
x[idx] <<- y[idx]}, envir=new.env(), enclos=parent.frame()),
M6=ifelse(x < y, y, x),
M7=sapply(seq_along(x), function(i) {
if(x[i] < y[i]) y[i] else x[i]
}),
M8=with(data.frame(idx=x < y), x[idx] <<- y[idx]),
M9=pmax(x, y), # off topic
times=5L)
If you define assignments functions such as the one below (improved from this question, and sorry I don't have a packaged version to propose)
`<<-` <- function(e1, e2, value){
if(missing(value))
eval.parent(substitute(.Primitive("<<-")(e1, e2)))
else {
cond <- e1 < e2
if(any(cond))
replace(e1, cond, value)
else e1
}
}
You can do x < y <- y[x <y]
instead of x[x < y] <- y[x <y]
Now if on top of it you use dotdot, which replaces ..
in the rhs by the lhs, you can do the following:
library(dotdot)
x <- 1:10
y <- 10:1
x < y := y[..]
x
# [1] 10 9 8 7 6 6 7 8 9 10
Not sure how readable it is, I'd usually use the modified <<-
for things like x < 0 <- NA
and dotdot with a simple lhs, but it's probably the most compact you can get!
Note that x < y
will still be evaluated twice here, as dotdot was not coded with this weird case in mind :).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With