I have two character variables (names of objects) and I want to extract the largest common substring.
a <- c('blahABCfoo', 'blahDEFfoo')
b <- c('XXABC-123', 'XXDEF-123')
I want the following as a result:
[1] "ABC" "DEF"
These vectors as input should give the same result:
a <- c('textABCxx', 'textDEFxx')
b <- c('zzABCblah', 'zzDEFblah')
These examples are representative. The strings contain identifying elements, and the remainder of the text in each vector element is common, but unknown.
Is there a solution, in one of the following places (in order of preference):
Base R
Recommended Packages
Packages available on CRAN
The answer to the supposed-duplicate does not fulfill these requirements.
The longest common substrings of a set of strings can be found by building a generalized suffix tree for the strings, and then finding the deepest internal nodes which have leaf nodes from all the strings in the subtree below it.
To find common substrings between two strings with Python, we can use the difflib module. We have 2 strings string1 and string2 that we want to find the common substring that's in both strings. To do that, we use the SequenceMatcher class with string1 and string2 .
Approach: Count the frequencies of all the characters from both strings. Now, for every character if the frequency of this character in string s1 is freq1 and in string s2 is freq2 then total valid pairs with this character will be min(freq1, freq2). The sum of this value for all the characters is the required answer.
Here's a CRAN package for that:
library(qualV)
sapply(seq_along(a), function(i)
paste(LCS(strsplit(a[i], '')[[1]], strsplit(b[i], '')[[1]])$LCS,
collapse = ""))
If you dont mind using bioconductor packages, then, You can use Rlibstree
. The installation is pretty straightforward.
source("http://bioconductor.org/biocLite.R")
biocLite("Rlibstree")
Then, you can do:
require(Rlibstree)
ll <- list(a,b)
lapply(data.frame(do.call(rbind, ll), stringsAsFactors=FALSE),
function(x) getLongestCommonSubstring(x))
# $X1
# [1] "ABC"
# $X2
# [1] "DEF"
On a side note: I'm not quite sure if Rlibstree
uses libstree 0.42
or libstree 0.43
. Both libraries are present in the source package. I remember running into a memory leak (and hence an error) on a huge array in perl that was using libstree 0.42
. Just a heads up.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With