Can anyone help me make this R code more efficient?
I'm trying to write a function that changes a list of strings to a vector of strings, or a list of numbers to a vector of numbers, of lists of typed elements to vectors of a certain type in general.
I want to able to change lists to a particular type of vector if they have the folllowing properties:
They are homogenously typed. Every element of the list is of type 'character', or 'complex' or so on.
Each element of the list is length-one.
as_atomic <- local({
assert_is_valid_elem <- function (elem, mode) {
if (length(elem) != 1 || !is(elem, mode)) {
stop("")
}
TRUE
}
function (coll, mode) {
if (length(coll) == 0) {
vector(mode)
} else {
# check that the generic vector is composed only
# of length-one values, and each value has the correct type.
# uses more memory that 'for', but is presumably faster.
vapply(coll, assert_is_valid_elem, logical(1), mode = mode)
as.vector(coll, mode = mode)
}
}
})
For example,
as_atomic(list(1, 2, 3), 'numeric')
as.numeric(c(1,2,3))
# this fails (mixed types)
as_atomic( list(1, 'a', 2), 'character' )
# ERROR.
# this fails (non-length one element)
as_atomic( list(1, c(2,3,4), 5), 'numeric' )
# ERROR.
# this fails (cannot convert numbers to strings)
as_atomic( list(1, 2, 3), 'character' )
# ERROR.
The above code works fine, but it is very slow and I can't see any way to optimise it without changing the behaviour of the function. It's important the function 'as_atomic' behaves as it does; I can't switch to a base function that I'm familiar with (unlist, for example), since I need to throw an error for bad lists.
require(microbenchmark)
microbenchmark(
as_atomic( as.list(1:1000), 'numeric'),
vapply(1:1000, identity, integer(1)),
unit = 'ns'
)
On my (fairly fast) machine the benchmark has a frequency of about 40Hz, so this function is almost always rate limiting in my code. The vapply control benchmark has a frequency of about 1650Hz, which is still quite slow.
Is there any way to dramatically improve the efficiency of this operation? Any advice is appreciated.
If any clarification or edits are needed, please leave a comment below.
Hello all,
Sorry for the very belated reply; I had exams I needed to get to before I could try re-implement this.
Thank you all for the performance tips. I got the performance up from a terrible 40hz to a more acceptable 600hz using plain R code.
The largest speedups was from using typeof or mode instead of is; this really sped up the tight inner checking loop.
I'll probably have to bite the bullet and rewrite this in rcpp to get it really performant though.
There are two parts to this problem:
First, I'd avoid is()
because it's known to be slow. That gives:
check_valid <- function (elem, mode) {
if (length(elem) != 1) stop("Must be length 1")
if (mode(elem) != mode) stop("Not desired type")
TRUE
}
Now we need to figure out whether a loop or apply variant is faster. We'll benchmark with the worst possible case where all inputs are valid.
worst <- as.list(0:101)
library(microbenchmark)
options(digits = 3)
microbenchmark(
`for` = for(i in seq_along(worst)) check_valid(worst[[i]], "numeric"),
lapply = lapply(worst, check_valid, "numeric"),
vapply = vapply(worst, check_valid, "numeric", FUN.VALUE = logical(1))
)
## Unit: microseconds
## expr min lq median uq max neval
## for 278 293 301 318 1184 100
## lapply 274 282 291 310 1041 100
## vapply 273 284 288 298 1062 100
The three methods are basically tied. lapply()
is very slightly
faster, probably because of the special C tricks that it uses
Now let's look at a few ways of coercing a list to a vector:
change_mode <- function(x, mode) {
mode(x) <- mode
x
}
microbenchmark(
change_mode = change_mode(worst, "numeric"),
unlist = unlist(worst),
as.vector = as.vector(worst, "numeric")
)
## Unit: microseconds
## expr min lq median uq max neval
## change_mode 19.13 20.83 22.36 23.9 167.51 100
## unlist 2.42 2.75 3.11 3.3 22.58 100
## as.vector 1.79 2.13 2.37 2.6 8.05 100
So it looks like you're already using the fastest method, and the total cost is dominated by the check.
Another idea is that we might be able to get a little faster by looping over the vector once, instead of once to check and once to coerce:
as_atomic_for <- function (x, mode) {
out <- vector(mode, length(x))
for (i in seq_along(x)) {
check_valid(x[[i]], mode)
out[i] <- x[[i]]
}
out
}
microbenchmark(
as_atomic_for(worst, "numeric")
)
## Unit: microseconds
## expr min lq median uq max neval
## as_atomic_for(worst, "numeric") 497 524 557 685 1279 100
That's definitely worse.
All in all, I think this suggests if you want to make this function faster, you should try vectorising the check function in Rcpp.
Try:
as_atomic_2 <- function(x, mode) {
if(!length(unique(vapply(x, typeof, ""))) == 1L) stop("mixed types")
as.vector(x, mode)
}
as_atomic_2(list(1, 2, 3), 'numeric')
# [1] 1 2 3
as_atomic_2(list(1, 'a', 2), 'character')
# Error in as_atomic_2(list(1, "a", 2), "character") : mixed types
as_atomic_2(list(1, c(2,3,4), 5), 'numeric' )
# Error in as.vector(x, mode) :
# (list) object cannot be coerced to type 'double'
microbenchmark(
as_atomic( as.list(1:1000), 'numeric'),
as_atomic_2(as.list(1:1000), 'numeric'),
vapply(1:1000, identity, integer(1)),
unit = 'ns'
)
# Unit: nanoseconds
# expr min lq median
# as_atomic(as.list(1:1000), "numeric") 23571781 24059432 24747115.5
# as_atomic_2(as.list(1:1000), "numeric") 1008945 1038749 1062153.5
# vapply(1:1000, identity, integer(1)) 719317 762286 778376.5
Defining your own function to do the type checking seems to be the bottleneck. Using one of the builtin functions gives a large speedup. However, the call changes somewhat (although it might be possible to change that). The examples below are the fastest versions I could come up with:
As mentioned using is.numeric
, is.character
gives the largest speedup:
as_atomic2 <- function(l, check_type) {
if (!all(vapply(l, check_type, logical(1)))) stop("")
r <- unlist(l)
if (length(r) != length(l)) stop("")
r
}
The following is the fastest I could come up with using the original interface:
as_atomic3 <- function(l, type) {
if (!all(vapply(l, mode, character(length(type))) == type)) stop("")
r <- unlist(l)
if (length(r) != length(l)) stop("")
r
}
Benchmarking against original:
res <- microbenchmark(
as_atomic( as.list(1:1000), 'numeric'),
as_atomic2( as.list(1:1000), is.numeric),
as_atomic3( as.list(1:1000), 'numeric'),
unit = 'ns'
)
# expr min lq median uq max neval
# as_atomic(as.list(1:1000), "numeric") 13566275 14399729.0 14793812.0 15093380.5 34037349 100
# as_atomic2(as.list(1:1000), is.numeric) 314328 325977.0 346353.5 369852.5 896991 100
# as_atomic3(as.list(1:1000), "numeric") 856423 899942.5 967705.5 1023238.0 1598593 100
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With