Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Can my use of paste0() in R be corrected so that this function runs as fast as the original Python example?

Tags:

python

string

r

I am trying to play around with some R code I found recently that imitates parts of Norvig's spell checker written in Python; In particular, I am trying to work out the right way to implement the edit2 function in R:

def splits(word):
    return [(word[:i], word[i:]) 
            for i in range(len(word)+1)]

def edits1(word):
    pairs      = splits(word)
    deletes    = [a+b[1:]           for (a, b) in pairs if b]
    transposes = [a+b[1]+b[0]+b[2:] for (a, b) in pairs if len(b) > 1]
    replaces   = [a+c+b[1:]         for (a, b) in pairs for c in alphabet if b]
    inserts    = [a+c+b             for (a, b) in pairs for c in alphabet]
    return set(deletes + transposes + replaces + inserts)

def edits2(word):
    return set(e2 for e1 in edits1(word) for e2 in edits1(e1))

However, in my benchmarks, it seems, generating thousands of small strings in R using paste0 (or str_c from stringr, or stri_join from stringi) results in code that is roughly 10x (or ~100x, or ~50x) slower than the Python implementation shown by Norvig. (Yes, the stringr and stringi-based functions interestingly are even slower than using paste0.) My questions are (with #3 being the main one I want resolved):

  1. Am I doing this correctly (is the code "right")?

  2. If so, is this a known issue of R (extremely slow string concatenation)?

  3. Is there anything I can do about this to make this significantly faster (one or more orders of magnitude, at least) without rewriting the whole function in Rcpp11 or something like that?

Here is my R code I came up with for the edit2 function:

# 1. generate a list of all binary splits of a word
binary.splits <- function(w) {
  n <- nchar(w)
  lapply(0:n, function(x)
         c(stri_sub(w, 0, x), stri_sub(w, x + 1, n)))
}

# 2. generate a list of all bigrams for a word
bigram.unsafe <- function(word)
  sapply(2:nchar(word), function(i) substr(word, i-1, i))
bigram <- function(word)
  if (nchar(word) > 1) bigram.unsafe(word) else word

# 3. four edit types: deletion, transposition, replacement, and insertion
alphabet = letters
deletions <- function(splits) if (length(splits) > 1) {
   sapply(1:(length(splits)-1), function(i)
          paste0(splits[[i]][1], splits[[i+1]][2]), simplify=FALSE) 
} else {
    splits[[1]][2]
}   
transpositions <- function(splits) if (length(splits) > 2) {
  swaps <- rev(bigram.unsafe(stri_reverse(splits[[1]][2])))
  sapply(1:length(swaps), function(i)
         paste0(splits[[i]][1], swaps[i], splits[[i+2]][2]), simplify=FALSE)
} else {
  stri_reverse(splits[[1]][2])
} 
replacements <- function(splits) if (length(splits) > 1) {
  sapply(1:(length(splits)-1), function(i)
         lapply(alphabet, function(symbol)
                paste0(splits[[i]][1], symbol, splits[[i+1]][2])))
} else {
  alphabet
} 
insertions <- function(splits)
  sapply(splits, function(pair)
         lapply(alphabet, function(symbol)
                paste0(pair[1], symbol, pair[2]))) 

# 4. create a vector of all words at edit distance 1 given the input word
edit.1 <- function(word) {
  splits <- binary.splits(word)
  unique(unlist(c(deletions(splits),
                  transpositions(splits),
                  replacements(splits),
                  insertions(splits))))
}

# 5. create a simple function to generate all words of edit distance 1 and 2
edit.2 <- function(word) { 
  e1 <- edit.1(word)
  unique(c(unlist(lapply(e1, edit.1)), e1))
} 

If you start profiling this code, you will see that replacements and insertions have nested "lapplies" and seem to take 10x longer than the deletions or transpositions, because they generate far more spelling variants.

library(rbenchmark)
benchmark(edit.2('abcd'), replications=20)

This takes about 8 seconds on my Core i5 MacBook Air, while the corresponding Python benchmark (running the corresponding edit2 function 20 times) takes about 0.6 seconds, i.e., it is about 10-15 times faster!

I have tried using expand.grid to get rid of the inner lapply, but this made the code slower, not faster. And I know that using lapply in place of sapply makes my code a bit faster, but I do not see the point of using the "wrong" function (I want a vector back) for a minor speed bump. But maybe generating the result of the edit.2 function can be made much faster in pure R?

like image 200
fnl Avatar asked May 31 '14 10:05

fnl


People also ask

What is the paste0 function in R?

paste0() function in R Language is used to concatenate all elements without separator. Syntax: paste0(…, collapse = NULL) Parameters: …: one or more R objects, to be converted to character vectors. collapse: an optional character string to separate the results.

What's the difference between the R functions paste and paste0?

You can use the paste() and paste0() functions in R to concatenate elements of a vector into a single string. The paste() function concatenates strings using a space as the default separator. The paste0() function concatenates strings using no space as the default separator.

How do you paste a function in Python?

paste() method is used to paste an image on another image. This is where the new() method comes in handy. Parameters: image_1/image_object : It the image on which other image is to be pasted.

What is collapse in paste R?

The paste() function with collapse argument When you pass a paste argument to a vector, the separator parameter will not work. Hence here comes the collapse parameter, which is highly useful when you are dealing with the vectors. It represents the symbol or values which separate the elements in the vector.


1 Answers

Performance of R's paste0 vs. python's ''.join

The original title asked whether paste0 in R was 10x slower than string concatenation in python. If it is, then there's no hope of writing an algorithm that relies heavily on string concatenation in R that is as fast as the corresponding python algorithm.

I have

> R.version.string
[1] "R version 3.1.0 Patched (2014-05-31 r65803)"

and

>>> sys.version '3.4.0 (default, Apr 11 2014, 13:05:11) \n[GCC 4.8.2]'

Here's a first comparison

> library(microbenchmark)
> microbenchmark(paste0("a", "b"), times=1e6)
Unit: nanoseconds
             expr min   lq median   uq      max neval
 paste0("a", "b") 951 1071   1162 1293 21794972 1e+06

(so about 1s for all replicates) versus

>>> import timeit
>>> timeit.timeit("''.join(x)", "x=('a', 'b')", number=int(1e6))
0.119668865998392

I guess that's the 10x performance difference the original poster observed. However, R works better on vectors, and the algorithm involves vectors of words anyway, so we might be interested in the comparison

> x = y = sample(LETTERS, 1e7, TRUE); system.time(z <- paste0(x, y))
   user  system elapsed 
  1.479   0.009   1.488 

and

>>> setup = '''
import random
import string
y = x = [random.choice(string.ascii_uppercase) for _ in range(10000000)]
'''
>>> timeit.Timer("map(''.join, zip(x, y))", setup=setup).repeat(1)
[0.362522566007101]

This suggests that we would be on the right track if our R algorithm were to run at 1/4 the speed of python; the OP found a 10-fold difference, so it looks like there's room for improvement.

R iteration versus vectorization

The OP uses iteration (lapply and friends), rather than vectorization. We can compare the vector version to various approaches to iteration with the following

f0 = paste0

f1 = function(x, y) 
   vapply(seq_along(x), function(i, x, y) paste0(x[i], y[i]), character(1), x, y)

f2 = function(x, y) Map(paste0, x, y)

f3 = function(x, y) {
    z = character(length(x))
    for (i in seq_along(x)) 
        z[i] = paste0(x[i], y[i])
    z 
}

f3c = compiler::cmpfun(f3)    # explicitly compile

f4 = function(x, y) {
    z = character()
    for (i in seq_along(x)) 
        z[i] = paste0(x[i], y[i])
    z 
}

Scaling the data back, defining the 'vectorized' solution as f0, and comparing these approaches

> x = y = sample(LETTERS, 100000, TRUE)
> library(microbenchmark)
> microbenchmark(f0(x, y), f1(x, y), f2(x, y), f3(x, y), f3c(x, y), times=5)
Unit: milliseconds
      expr       min        lq    median        uq       max neval
  f0(x, y)  14.69877  14.70235  14.75409  14.98777  15.14739     5
  f1(x, y) 241.34212 250.19018 268.21613 279.01582 292.21065     5
  f2(x, y) 198.74594 199.07489 214.79558 229.50684 271.77853     5
  f3(x, y) 250.64388 251.88353 256.09757 280.04688 296.29095     5
 f3c(x, y) 174.15546 175.46522 200.09589 201.18543 214.18290     5

with f4 being too painfully slow to include

> system.time(f4(x, y))
   user  system elapsed 
 24.325   0.000  24.330 

So from this one can see the advice from Dr. Tierney, that there may be a benefit to vectorizing those lapply calls.

Further vectorizing the updated original post

@fnl adopted the original code by partly unrolling the loops. There remain opportunities for more of the same, for instance,

replacements <- function(splits) if (length(splits$left) > 1) {
  lapply(1:(length(splits$left)-1), function(i)
         paste0(splits$left[i], alphabet, splits$right[i+1]))
} else {
  splits$right[1]
}

might be revised to perform a single paste call, relying on argument recycling (short vectors recycled until their length matches longer vectors)

replacements1 <- function(splits) if (length(splits$left) > 1) {
    len <- length(splits$left)
    paste0(splits$left[-len], rep(alphabet, each = len - 1), splits$right[-1])
} else {
  splits$right[1]
}

The values are in different order, but that is not important for the algorithm. Dropping subscripts (prefix with -) is potentially more memory efficient. Similiarly

deletions1 <- function(splits) if (length(splits$left) > 1) {
  paste0(splits$left[-length(splits$left)], splits$right[-1])
} else {
  splits$right[1]
}

insertions1 <- function(splits)
    paste0(splits$left, rep(alphabet, each=length(splits$left)), splits$right)

We then have

edit.1.1 <- function(word) {
  splits <- binary.splits(word)
  unique(c(deletions1(splits),
           transpositions(splits),
           replacements1(splits),
           insertions1(splits)))
}

with some speed-up

> identical(sort(edit.1("word")), sort(edit.1.1("word")))
[1] TRUE
> microbenchmark(edit.1("word"), edit.1.1("word"))
Unit: microseconds
             expr     min       lq   median       uq     max neval
   edit.1("word") 354.125 358.7635 362.5260 372.9185 521.337   100
 edit.1.1("word") 296.575 298.9830 300.8305 307.3725 369.419   100

The OP indicates that their original version was 10x slower than python, and that their original modifications resulted in a 5x speed-up. We gain a further 1.2x speed-up, so are perhaps at the expected performance of the algorithm using R's paste0. A next step is to ask whether alternative algorithms or implementations are more performant, in particular substr might be promising.

like image 178
Martin Morgan Avatar answered Oct 04 '22 17:10

Martin Morgan