Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

longest common substring in R finding non-contiguous matches between the two strings

Tags:

I have a question regarding finding the longest common substring in R. While searching through a few posts on StackOverflow, I got to know about the qualV package. However, I see that the LCS function in this package actually finds all characters from string1 which are present in string2, even if they are not contiguous.

To explain, if the strings are string1 : "hello" string2 : "hel12345lo" I expect the output to be hel, however I get the output as hello. I must be doing something wrong. Please see my code below.

library(qualV)
a= "hello"
b="hel123l5678o" 
sapply(seq_along(a), function(i)
    paste(LCS(substring(a[i], seq(1, nchar(a[i])), seq(1, nchar(a[i]))),
              substring(b[i], seq(1, nchar(b[i])), seq(1, nchar(b[i]))))$LCS,
          collapse = ""))

I have also tried the Rlibstree method but I still get substrings which are not contiguous. Also, the length of the substring is also off from my expectation.s Please see below.

> a = "hello"
> b = "h1e2l3l4o5"

> ll <- list(a,b)
> lapply(data.frame(do.call(rbind, ll), stringsAsFactors=FALSE), function(x) getLongestCommonSubstring(x))
$do.call.rbind..ll.
[1] "h" "e" "l" "o"

> nchar(lapply(data.frame(do.call(rbind, ll), stringsAsFactors=FALSE), function(x) getLongestCommonSubstring(x)))
do.call.rbind..ll.
                21
like image 905
IAMTubby Avatar asked Feb 01 '15 10:02

IAMTubby


People also ask

How do you find the common substring of two strings?

For every character in string 1 we increment vector index of that character eg: v[s1[i]-'a']++, for every character of string 2 we check vector for the common characters if v[s2[i]-'a'] > 0 then set flag = true and v[s2[i]-'a']– such that one character of string 2 is compared with only one character of string 1.

Which is longest common substring in s1 Abcdaf s2 Azbcdf?

The longest common substring is “abcdez” and is of length 6. Approach: Let m and n be the lengths of the first and second strings respectively. Dynamic Programming can be used to find the longest common substring in O(m*n) time.

What is a time complexity for finding the longest substring that is common?

Θ(n1+n2)


1 Answers

Here are three possible solutions.

library(stringi)
library(stringdist)

a <- "hello"
b <- "hel123l5678o"

## get all forward substrings of 'b'
sb <- stri_sub(b, 1, 1:nchar(b))
## extract them from 'a' if they exist
sstr <- na.omit(stri_extract_all_coll(a, sb, simplify=TRUE))
## match the longest one
sstr[which.max(nchar(sstr))]
# [1] "hel"

There are also adist() and agrep() in base R, and the stringdist package has a few functions that run the LCS method. Here's a look at stringsidt. It returns the number of unpaired characters.

stringdist(a, b, method="lcs")
# [1] 7

Filter("!", mapply(
    stringdist, 
    stri_sub(b, 1, 1:nchar(b)),
    stri_sub(a, 1, 1:nchar(b)),
    MoreArgs = list(method = "lcs")
))
#  h  he hel 
#  0   0   0 

Now that I've explored this a bit more, I think adist() might be the way to go. If we set counts=TRUE we get a sequence of Matches, Insertions, etc. So if you give that to stri_locate() we can use that matrix to get the matches from a to b.

ta <- drop(attr(adist(a, b, counts=TRUE), "trafos")))
# [1] "MMMIIIMIIIIM"

So the M values denote straight across matches. We can go and get the substrings with stri_sub()

stri_sub(b, stri_locate_all_regex(ta, "M+")[[1]])
# [1] "hel" "l"   "o" 

Sorry I haven't explained that very well as I'm not well versed in string distance algorithms.

like image 190
Rich Scriven Avatar answered Sep 30 '22 13:09

Rich Scriven