Can my use of paste0() in R be corrected so that this function runs as fast as the original Python example?

Tags:

I am trying to play around with some R code I found recently that imitates parts of Norvig's spell checker written in Python; In particular, I am trying to work out the right way to implement the edit2 function in R:

def splits(word):
    return [(word[:i], word[i:]) 
            for i in range(len(word)+1)]

def edits1(word):
    pairs      = splits(word)
    deletes    = [a+b[1:]           for (a, b) in pairs if b]
    transposes = [a+b[1]+b[0]+b[2:] for (a, b) in pairs if len(b) > 1]
    replaces   = [a+c+b[1:]         for (a, b) in pairs for c in alphabet if b]
    inserts    = [a+c+b             for (a, b) in pairs for c in alphabet]
    return set(deletes + transposes + replaces + inserts)

def edits2(word):
    return set(e2 for e1 in edits1(word) for e2 in edits1(e1))

However, in my benchmarks, it seems, generating thousands of small strings in R using paste0 (or str_c from stringr, or stri_join from stringi) results in code that is roughly 10x (or ~100x, or ~50x) slower than the Python implementation shown by Norvig. (Yes, the stringr and stringi-based functions interestingly are even slower than using paste0.) My questions are (with #3 being the main one I want resolved):

Am I doing this correctly (is the code "right")?
If so, is this a known issue of R (extremely slow string concatenation)?
Is there anything I can do about this to make this significantly faster (one or more orders of magnitude, at least) without rewriting the whole function in Rcpp11 or something like that?

Here is my R code I came up with for the edit2 function:

# 1. generate a list of all binary splits of a word
binary.splits <- function(w) {
  n <- nchar(w)
  lapply(0:n, function(x)
         c(stri_sub(w, 0, x), stri_sub(w, x + 1, n)))
}

# 2. generate a list of all bigrams for a word
bigram.unsafe <- function(word)
  sapply(2:nchar(word), function(i) substr(word, i-1, i))
bigram <- function(word)
  if (nchar(word) > 1) bigram.unsafe(word) else word

# 3. four edit types: deletion, transposition, replacement, and insertion
alphabet = letters
deletions <- function(splits) if (length(splits) > 1) {
   sapply(1:(length(splits)-1), function(i)
          paste0(splits[[i]][1], splits[[i+1]][2]), simplify=FALSE) 
} else {
    splits[[1]][2]
}   
transpositions <- function(splits) if (length(splits) > 2) {
  swaps <- rev(bigram.unsafe(stri_reverse(splits[[1]][2])))
  sapply(1:length(swaps), function(i)
         paste0(splits[[i]][1], swaps[i], splits[[i+2]][2]), simplify=FALSE)
} else {
  stri_reverse(splits[[1]][2])
} 
replacements <- function(splits) if (length(splits) > 1) {
  sapply(1:(length(splits)-1), function(i)
         lapply(alphabet, function(symbol)
                paste0(splits[[i]][1], symbol, splits[[i+1]][2])))
} else {
  alphabet
} 
insertions <- function(splits)
  sapply(splits, function(pair)
         lapply(alphabet, function(symbol)
                paste0(pair[1], symbol, pair[2]))) 

# 4. create a vector of all words at edit distance 1 given the input word
edit.1 <- function(word) {
  splits <- binary.splits(word)
  unique(unlist(c(deletions(splits),
                  transpositions(splits),
                  replacements(splits),
                  insertions(splits))))
}

# 5. create a simple function to generate all words of edit distance 1 and 2
edit.2 <- function(word) { 
  e1 <- edit.1(word)
  unique(c(unlist(lapply(e1, edit.1)), e1))
}

If you start profiling this code, you will see that replacements and insertions have nested "lapplies" and seem to take 10x longer than the deletions or transpositions, because they generate far more spelling variants.

library(rbenchmark)
benchmark(edit.2('abcd'), replications=20)

This takes about 8 seconds on my Core i5 MacBook Air, while the corresponding Python benchmark (running the corresponding edit2 function 20 times) takes about 0.6 seconds, i.e., it is about 10-15 times faster!

I have tried using expand.grid to get rid of the inner lapply, but this made the code slower, not faster. And I know that using lapply in place of sapply makes my code a bit faster, but I do not see the point of using the "wrong" function (I want a vector back) for a minor speed bump. But maybe generating the result of the edit.2 function can be made much faster in pure R?

200

asked May 31 '14 10:05

fnl

1 Answers

Performance of R's paste0 vs. python's ''.join

The original title asked whether paste0 in R was 10x slower than string concatenation in python. If it is, then there's no hope of writing an algorithm that relies heavily on string concatenation in R that is as fast as the corresponding python algorithm.

I have

> R.version.string
[1] "R version 3.1.0 Patched (2014-05-31 r65803)"

and

>>> sys.version '3.4.0 (default, Apr 11 2014, 13:05:11) \n[GCC 4.8.2]'

Here's a first comparison

> library(microbenchmark)
> microbenchmark(paste0("a", "b"), times=1e6)
Unit: nanoseconds
             expr min   lq median   uq      max neval
 paste0("a", "b") 951 1071   1162 1293 21794972 1e+06

(so about 1s for all replicates) versus

>>> import timeit
>>> timeit.timeit("''.join(x)", "x=('a', 'b')", number=int(1e6))
0.119668865998392

I guess that's the 10x performance difference the original poster observed. However, R works better on vectors, and the algorithm involves vectors of words anyway, so we might be interested in the comparison

> x = y = sample(LETTERS, 1e7, TRUE); system.time(z <- paste0(x, y))
   user  system elapsed 
  1.479   0.009   1.488

and

>>> setup = '''
import random
import string
y = x = [random.choice(string.ascii_uppercase) for _ in range(10000000)]
'''
>>> timeit.Timer("map(''.join, zip(x, y))", setup=setup).repeat(1)
[0.362522566007101]

This suggests that we would be on the right track if our R algorithm were to run at 1/4 the speed of python; the OP found a 10-fold difference, so it looks like there's room for improvement.

R iteration versus vectorization

The OP uses iteration (lapply and friends), rather than vectorization. We can compare the vector version to various approaches to iteration with the following

f0 = paste0

f1 = function(x, y) 
   vapply(seq_along(x), function(i, x, y) paste0(x[i], y[i]), character(1), x, y)

f2 = function(x, y) Map(paste0, x, y)

f3 = function(x, y) {
    z = character(length(x))
    for (i in seq_along(x)) 
        z[i] = paste0(x[i], y[i])
    z 
}

f3c = compiler::cmpfun(f3)    # explicitly compile

f4 = function(x, y) {
    z = character()
    for (i in seq_along(x)) 
        z[i] = paste0(x[i], y[i])
    z 
}

Scaling the data back, defining the 'vectorized' solution as f0, and comparing these approaches

> x = y = sample(LETTERS, 100000, TRUE)
> library(microbenchmark)
> microbenchmark(f0(x, y), f1(x, y), f2(x, y), f3(x, y), f3c(x, y), times=5)
Unit: milliseconds
      expr       min        lq    median        uq       max neval
  f0(x, y)  14.69877  14.70235  14.75409  14.98777  15.14739     5
  f1(x, y) 241.34212 250.19018 268.21613 279.01582 292.21065     5
  f2(x, y) 198.74594 199.07489 214.79558 229.50684 271.77853     5
  f3(x, y) 250.64388 251.88353 256.09757 280.04688 296.29095     5
 f3c(x, y) 174.15546 175.46522 200.09589 201.18543 214.18290     5

with f4 being too painfully slow to include

> system.time(f4(x, y))
   user  system elapsed 
 24.325   0.000  24.330

So from this one can see the advice from Dr. Tierney, that there may be a benefit to vectorizing those lapply calls.

Further vectorizing the updated original post

@fnl adopted the original code by partly unrolling the loops. There remain opportunities for more of the same, for instance,

replacements <- function(splits) if (length(splits$left) > 1) {
  lapply(1:(length(splits$left)-1), function(i)
         paste0(splits$left[i], alphabet, splits$right[i+1]))
} else {
  splits$right[1]
}

might be revised to perform a single paste call, relying on argument recycling (short vectors recycled until their length matches longer vectors)

replacements1 <- function(splits) if (length(splits$left) > 1) {
    len <- length(splits$left)
    paste0(splits$left[-len], rep(alphabet, each = len - 1), splits$right[-1])
} else {
  splits$right[1]
}

The values are in different order, but that is not important for the algorithm. Dropping subscripts (prefix with -) is potentially more memory efficient. Similiarly

deletions1 <- function(splits) if (length(splits$left) > 1) {
  paste0(splits$left[-length(splits$left)], splits$right[-1])
} else {
  splits$right[1]
}

insertions1 <- function(splits)
    paste0(splits$left, rep(alphabet, each=length(splits$left)), splits$right)

We then have

edit.1.1 <- function(word) {
  splits <- binary.splits(word)
  unique(c(deletions1(splits),
           transpositions(splits),
           replacements1(splits),
           insertions1(splits)))
}

with some speed-up

> identical(sort(edit.1("word")), sort(edit.1.1("word")))
[1] TRUE
> microbenchmark(edit.1("word"), edit.1.1("word"))
Unit: microseconds
             expr     min       lq   median       uq     max neval
   edit.1("word") 354.125 358.7635 362.5260 372.9185 521.337   100
 edit.1.1("word") 296.575 298.9830 300.8305 307.3725 369.419   100

The OP indicates that their original version was 10x slower than python, and that their original modifications resulted in a 5x speed-up. We gain a further 1.2x speed-up, so are perhaps at the expected performance of the algorithm using R's paste0. A next step is to ask whether alternative algorithms or implementations are more performant, in particular substr might be promising.

178

answered Oct 04 '22 17:10

Martin Morgan

Related questions
                            
                                Numpy: Multiple Outer Products
                            
                                In-place sort of sublist
                            
                                pyscopg2 select NULL values
                            
                                Travis is not finding pandas installed by conda
                            
                                Numpy and matlab polyfit results differences
                            
                                Transposing part of a pandas dataframe
                            
                                How to get pandas.read_csv() to infer datetime and timedelta types from CSV file columns?
                            
                                Python ftplib: Show FTP upload progress
                            
                                Maze solving with python
                            
                                BitTorrent Client : Getting Peer List From Trackers [Python]
                            
                                Yowsup WhatsApp get phone number
                            
                                What happens to memory locations in Python when you overwrite a variable?
                            
                                How do I detach python script run from Pycharm so it keeps appending to files?
                            
                                Python unittest ignore numpy
                            
                                Good way to add terms to python pattern singularize
                            
                                How to swich theano.tensor to numpy.array?
                            
                                python - repeating numpy array without replicating data
                            
                                A new equation for intelligence: Difficulties with Entropy Maximization
                            
                                Python - Set zlim in mplot3D
                            
                                the trick to nested structures in pyparsing

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Can my use of paste0() in R be corrected so that this function runs as fast as the original Python example?

Tags:

python

string

r

fnl

People also ask