Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

gsub speed vs pattern length

I've been using gsub extensively lately, and I noticed that short patterns run faster than long ones, which is not surprising. Here's a fully reproducible code:

library(microbenchmark)
set.seed(12345)
n = 0
rpt = seq(20, 1461, 20)
msecFF = numeric(length(rpt))
msecFT = numeric(length(rpt))
inp = rep("aaaaaaaaaa",15000)

for (i in rpt) {
  n = n + 1
  print(n)
  patt = paste(rep("a", rpt[n]), collapse = "")
  #time = microbenchmark(func(count[1:10000,12], patt, "b"), times = 10)
  timeFF = microbenchmark(gsub(patt, "b", inp, fixed=F), times = 10)
  msecFF[n] = mean(timeFF$time)/1000000.

  timeFT = microbenchmark(gsub(patt, "b", inp, fixed=T), times = 10)
  msecFT[n] = mean(timeFT$time)/1000000.
}

library(ggplot2)
library(grid)
library(gridExtra)

axis(1,at=seq(0,1000,200),labels=T)

p1 = qplot(rpt, msecFT, xlab="pattern length, characters", ylab="time, msec",main="fixed = TRUE" )
p2 = qplot(rpt, msecFF, xlab="pattern length, characters", ylab="time, msec",main="fixed = FALSE")
grid.arrange(p1, p2, nrow = 2)

As you see, I'm looking for a pattern that contains a replicated rpt[n] times. The slope is positive, as expected. However, I noticed a kink at 300 characters with fixed=T and 600 characters with fixed=F and then the slope seems to be approximately as before (see plot below). I suppose, it is due to memory, object size, etc. I also noticed that the longest allowed pattern is 1463 symbols, with object size of 1552 bytes.

Can someone explain the kink better and why at 300 and 600 characters?

gsub speed in milliseconds with fixed turned on/off as a function of pattern length in characters

Added: it is worth mentioning, that most of my patterns are 5-10 characters long, which gives me on my real data (not the mock-up inp in the example above) the following timing.

gsub, fixed = TRUE: ~50 msec per one pattern
gsub, fixed = FALSE: ~190 msec per one pattern
stringi, fixed = FALSE: ~55 msec per one pattern
gsub, fixed = FALSE, perl = TRUE: ~95 msec per one pattern

(I have 4k patterns, so total timing of my module is roughly 200 sec, which is exactly 0.05 x 4000 with gsub and fixed = TRUE. It is the fastest method for my data and patterns)

Real data timing

like image 515
Alexey Ferapontov Avatar asked Dec 17 '14 20:12

Alexey Ferapontov


1 Answers

The kinks might be related to the bits required to hold patterns of that length.

There is another solution that scales much better, use the repetition operator {} to specify how many repeats you want to find. In order to find more than 255 (8 bit integer max) you'll have to specify perl = TRUE.

patt2 <- paste0('a{',rpt[n],'}')
timeRF <- microbenchmark(gsub(patt2, "b", inp, perl = T), times = 10)

I get speeds of around 2.1 ms per search with no penalty for pattern length. That's about 8x faster than fixed = FALSE for small pattern lengths and about 60x faster for large pattern lengths.

like image 93
joeyreid Avatar answered Oct 18 '22 13:10

joeyreid