Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Efficient conversion to vectors in R

Can anyone help me make this R code more efficient?

I'm trying to write a function that changes a list of strings to a vector of strings, or a list of numbers to a vector of numbers, of lists of typed elements to vectors of a certain type in general.

I want to able to change lists to a particular type of vector if they have the folllowing properties:

  1. They are homogenously typed. Every element of the list is of type 'character', or 'complex' or so on.

  2. Each element of the list is length-one.

    as_atomic <- local({
    
        assert_is_valid_elem <- function (elem, mode) {
    
            if (length(elem) != 1 || !is(elem, mode)) {
                stop("")
            }
            TRUE
        }
    
        function (coll, mode) {
    
            if (length(coll) == 0) {
                vector(mode)
            } else {
                # check that the generic vector is composed only
                # of length-one values, and each value has the correct type.
    
                # uses more memory that 'for', but is presumably faster.
                vapply(coll, assert_is_valid_elem, logical(1), mode = mode)
    
                as.vector(coll, mode = mode)
            }
        }
    })
    

For example,

as_atomic(list(1, 2, 3), 'numeric')
as.numeric(c(1,2,3))

# this fails (mixed types)
as_atomic( list(1, 'a', 2), 'character' )
# ERROR.

# this fails (non-length one element)
as_atomic( list(1, c(2,3,4), 5), 'numeric' )
# ERROR.

# this fails (cannot convert numbers to strings)
as_atomic( list(1, 2, 3), 'character' )
# ERROR.

The above code works fine, but it is very slow and I can't see any way to optimise it without changing the behaviour of the function. It's important the function 'as_atomic' behaves as it does; I can't switch to a base function that I'm familiar with (unlist, for example), since I need to throw an error for bad lists.

require(microbenchmark)

microbenchmark(
    as_atomic( as.list(1:1000), 'numeric'),
    vapply(1:1000, identity, integer(1)),
    unit = 'ns'
)

On my (fairly fast) machine the benchmark has a frequency of about 40Hz, so this function is almost always rate limiting in my code. The vapply control benchmark has a frequency of about 1650Hz, which is still quite slow.

Is there any way to dramatically improve the efficiency of this operation? Any advice is appreciated.

If any clarification or edits are needed, please leave a comment below.

Edit:

Hello all,

Sorry for the very belated reply; I had exams I needed to get to before I could try re-implement this.

Thank you all for the performance tips. I got the performance up from a terrible 40hz to a more acceptable 600hz using plain R code.

The largest speedups was from using typeof or mode instead of is; this really sped up the tight inner checking loop.

I'll probably have to bite the bullet and rewrite this in rcpp to get it really performant though.

like image 348
Róisín Grannell Avatar asked Mar 19 '14 18:03

Róisín Grannell


3 Answers

There are two parts to this problem:

  1. checking that inputs are valid
  2. coercing a list to a vector

Checking valid inputs

First, I'd avoid is() because it's known to be slow. That gives:

check_valid <- function (elem, mode) {
  if (length(elem) != 1) stop("Must be length 1")
  if (mode(elem) != mode) stop("Not desired type")

  TRUE
}

Now we need to figure out whether a loop or apply variant is faster. We'll benchmark with the worst possible case where all inputs are valid.

worst <- as.list(0:101)

library(microbenchmark)
options(digits = 3)
microbenchmark(
  `for` = for(i in seq_along(worst)) check_valid(worst[[i]], "numeric"),
  lapply = lapply(worst, check_valid, "numeric"),
  vapply = vapply(worst, check_valid, "numeric", FUN.VALUE = logical(1))
)

## Unit: microseconds
##    expr min  lq median  uq  max neval
##     for 278 293    301 318 1184   100
##  lapply 274 282    291 310 1041   100
##  vapply 273 284    288 298 1062   100

The three methods are basically tied. lapply() is very slightly faster, probably because of the special C tricks that it uses

Coercing list to vector

Now let's look at a few ways of coercing a list to a vector:

change_mode <- function(x, mode) {
  mode(x) <- mode
  x
}

microbenchmark(
  change_mode = change_mode(worst, "numeric"),
  unlist = unlist(worst),
  as.vector = as.vector(worst, "numeric")
)

## Unit: microseconds
##         expr   min    lq median   uq    max neval
##  change_mode 19.13 20.83  22.36 23.9 167.51   100
##       unlist  2.42  2.75   3.11  3.3  22.58   100
##    as.vector  1.79  2.13   2.37  2.6   8.05   100

So it looks like you're already using the fastest method, and the total cost is dominated by the check.

Alternative approach

Another idea is that we might be able to get a little faster by looping over the vector once, instead of once to check and once to coerce:

as_atomic_for <- function (x, mode) {
  out <- vector(mode, length(x))

  for (i in seq_along(x)) {
    check_valid(x[[i]], mode)
    out[i] <- x[[i]]
  }

  out
}
microbenchmark(
  as_atomic_for(worst, "numeric")
)

## Unit: microseconds
##                             expr min  lq median  uq  max neval
##  as_atomic_for(worst, "numeric") 497 524    557 685 1279   100

That's definitely worse.

All in all, I think this suggests if you want to make this function faster, you should try vectorising the check function in Rcpp.

like image 151
hadley Avatar answered Oct 09 '22 20:10

hadley


Try:

as_atomic_2 <- function(x, mode) {
  if(!length(unique(vapply(x, typeof, ""))) == 1L) stop("mixed types")
  as.vector(x, mode)
}
as_atomic_2(list(1, 2, 3), 'numeric')
# [1] 1 2 3
as_atomic_2(list(1, 'a', 2), 'character')
# Error in as_atomic_2(list(1, "a", 2), "character") : mixed types
as_atomic_2(list(1, c(2,3,4), 5), 'numeric' )
# Error in as.vector(x, mode) : 
#   (list) object cannot be coerced to type 'double'

microbenchmark(
  as_atomic( as.list(1:1000), 'numeric'),
  as_atomic_2(as.list(1:1000), 'numeric'),
  vapply(1:1000, identity, integer(1)),
  unit = 'ns'
)    
# Unit: nanoseconds
#                                     expr      min       lq     median 
#    as_atomic(as.list(1:1000), "numeric") 23571781 24059432 24747115.5 
#  as_atomic_2(as.list(1:1000), "numeric")  1008945  1038749  1062153.5 
#     vapply(1:1000, identity, integer(1))   719317   762286   778376.5 
like image 4
BrodieG Avatar answered Oct 09 '22 19:10

BrodieG


Defining your own function to do the type checking seems to be the bottleneck. Using one of the builtin functions gives a large speedup. However, the call changes somewhat (although it might be possible to change that). The examples below are the fastest versions I could come up with:

As mentioned using is.numeric, is.character gives the largest speedup:

as_atomic2 <- function(l, check_type) {
  if (!all(vapply(l, check_type, logical(1)))) stop("")
  r <- unlist(l)
  if (length(r) != length(l)) stop("")
  r
} 

The following is the fastest I could come up with using the original interface:

as_atomic3 <- function(l, type) {
  if (!all(vapply(l, mode, character(length(type))) == type)) stop("")
  r <- unlist(l)
  if (length(r) != length(l)) stop("")
  r
}

Benchmarking against original:

res <- microbenchmark(
    as_atomic( as.list(1:1000), 'numeric'),
    as_atomic2( as.list(1:1000), is.numeric),
    as_atomic3( as.list(1:1000), 'numeric'),
    unit = 'ns'
)
#                                    expr      min         lq     median         uq      max neval
#   as_atomic(as.list(1:1000), "numeric") 13566275 14399729.0 14793812.0 15093380.5 34037349   100
# as_atomic2(as.list(1:1000), is.numeric)   314328   325977.0   346353.5   369852.5   896991   100
#  as_atomic3(as.list(1:1000), "numeric")   856423   899942.5   967705.5  1023238.0  1598593   100
like image 3
Jan van der Laan Avatar answered Oct 09 '22 19:10

Jan van der Laan