Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R: find largest common substring starting at the beginning

Tags:

substring

r

I've got 2 vectors:

word1 <- "bestelling"   
word2 <- "bestelbon"

Now I want to find the largest common substring that starts at the beginnig, so here that would be "bestel".

But take for example two other words like "bestelling" and "stel", then I want to return "".

like image 207
Anita Avatar asked Oct 09 '14 17:10

Anita


2 Answers

fun <- function(words) {
  #extract substrings from length 1 to length of shortest word
  subs <- sapply(seq_len(min(nchar(words))), 
                 function(x, words) substring(words, 1, x), 
                 words=words)
  #max length for which substrings are equal
  neqal <- max(cumsum(apply(subs, 2, function(x) length(unique(x)) == 1L)))
  #return substring
  substring(words[1], 1, neqal)
}

words1 <- c("bestelling", "bestelbon")
fun(words1)
#[1] "bestel"

words2 <- c("bestelling", "stel")
fun(words2)
#[1] ""
like image 82
Roland Avatar answered Oct 08 '22 19:10

Roland


Matthew Plourde called, and Mr. Benchmarker responds!
Sorry, BondedDust, but I can't get to bioconductor from behind workplace walls.

library(microbenchmark)
wfoo1 <-'bestelling'
wfoo2<-'bestelbon'


microbenchmark(stu(wfoo1,wfoo2),nathan(wfoo1,wfoo2),plourde(),scriven(wfoo1,wfoo2),dmt(wfoo1,wfoo2),mrflick(wfoo1,wfoo2),roland(c(wfoo1,wfoo2)))
Unit: microseconds
                    expr     min       lq   median       uq
       stu(wfoo1, wfoo2) 171.905 183.0230 187.5135 191.1490
    nathan(wfoo1, wfoo2)  35.921  42.3360  43.6180  46.1840
               plourde() 551.208 581.3545 591.6175 602.5220
   scriven(wfoo1, wfoo2)  16.678  21.1680  22.6645  23.7335
       dmt(wfoo1, wfoo2)  79.966  86.1665  88.7325  91.5125
   mrflick(wfoo1, wfoo2) 100.492 108.4030 111.1830 113.9625
 roland(c(wfoo1, wfoo2)) 215.950 226.8545 231.7725 237.5455
     max neval
 435.321   100
  59.012   100
 730.809   100
  85.525   100
 286.081   100
 466.537   100
 291.213   100

I think it's incumbent on me to modify these functions so they measure an input word against, say, a vector of 1000 reference words (rather than just a single pair) to see how that speed test goes. Maybe later.

Later... :-). I didn't make loops,but I tried it out on long words:

EDIT: this was, as flodel points out, a typo, which led to testing a rather long vector of very short words!

wfoo1 <-rep(letters,100)
wfoo2<-c(rep(letters,99),'foo')
Unit: microseconds
                    expr        min          lq      median
       stu(wfoo1, wfoo2)  31215.243  32718.5535  35270.6110
    nathan(wfoo1, wfoo2)    202.266    216.3780    227.2825
               plourde()    569.168    617.0615    661.5340
   scriven(wfoo1, wfoo2)    794.953    828.3070    847.5505
       dmt(wfoo1, wfoo2)   1081.033   1156.9365   1205.8990
   mrflick(wfoo1, wfoo2) 126058.316 131283.4485 241018.5150
 roland(c(wfoo1, wfoo2))    946.759   1004.4885   1045.3260
          uq        max neval
 146451.2595 167000.713   100
    236.0485    356.211   100
    694.6750    795.381   100
    868.9310   1021.594   100
   1307.6740 116075.442   100
 246739.6910 991550.586   100
   1082.1020   1243.103   100

Sorry Richard, but looks like you need to give your chicken dinner to Nathan.

EDIT2: made sure the inputs were single words, and added flodel's code to the pile.

Edited the "plourde" function to accept inputs and reran the longword case

wfoo1 <-paste(rep(letters,100),collapse='')
wfoo2<-paste(c(rep(letters,99),'foo'),collapse='')

Looks like 3 folks' code perform similarly, so just as in Tour de France, I give the first-place award to mrflick, dmt, and flodel.

 microbenchmark(stu(wfoo1,wfoo2),nathan(wfoo1,wfoo2),plourde(c(wfoo1,wfoo2)),scriven(wfoo1,wfoo2),dmt(wfoo1,wfoo2),mrflick(wfoo1,wfoo2),roland(c(wfoo1,wfoo2)),flodel(wfoo1,wfoo2) )
Unit: microseconds
                     expr        min          lq     median
        stu(wfoo1, wfoo2)  17786.578  18243.2795  18420.317
     nathan(wfoo1, wfoo2)  36651.195  37703.3625  38095.493
 plourde(c(wfoo1, wfoo2)) 183616.029 187673.5350 190706.457
    scriven(wfoo1, wfoo2)  17546.253  17994.1890  18244.990
        dmt(wfoo1, wfoo2)    737.651    781.0550    821.466
    mrflick(wfoo1, wfoo2)    870.643    951.4630    976.479
  roland(c(wfoo1, wfoo2))  99540.947 102644.2115 103654.258
     flodel(wfoo1, wfoo2)    666.239    705.5795    717.553
         uq         max neval
  18602.270   20835.107   100
  38450.848  155422.375   100
 303856.952 1079715.032   100
  18404.281   18992.905   100
    853.751    1719.047   100
   1012.186  116669.839   100
 105423.123  226522.073   100
    732.947     822.748   100
like image 32
Carl Witthoft Avatar answered Oct 08 '22 18:10

Carl Witthoft