Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Find common substrings between two character variables

Tags:

r

lcs

I have two character variables (names of objects) and I want to extract the largest common substring.

a <- c('blahABCfoo', 'blahDEFfoo')
b <- c('XXABC-123', 'XXDEF-123')

I want the following as a result:

[1] "ABC" "DEF"

These vectors as input should give the same result:

a <- c('textABCxx', 'textDEFxx')
b <- c('zzABCblah', 'zzDEFblah')

These examples are representative. The strings contain identifying elements, and the remainder of the text in each vector element is common, but unknown.

Is there a solution, in one of the following places (in order of preference):

  1. Base R

  2. Recommended Packages

  3. Packages available on CRAN

The answer to the supposed-duplicate does not fulfill these requirements.

like image 361
Matthew Lundberg Avatar asked Apr 24 '13 15:04

Matthew Lundberg


People also ask

How do you find common substrings?

The longest common substrings of a set of strings can be found by building a generalized suffix tree for the strings, and then finding the deepest internal nodes which have leaf nodes from all the strings in the subtree below it.

How do you find common substrings in Python?

To find common substrings between two strings with Python, we can use the difflib module. We have 2 strings string1 and string2 that we want to find the common substring that's in both strings. To do that, we use the SequenceMatcher class with string1 and string2 .

How do you find common characters in two strings?

Approach: Count the frequencies of all the characters from both strings. Now, for every character if the frequency of this character in string s1 is freq1 and in string s2 is freq2 then total valid pairs with this character will be min(freq1, freq2). The sum of this value for all the characters is the required answer.


2 Answers

Here's a CRAN package for that:

library(qualV)

sapply(seq_along(a), function(i)
    paste(LCS(strsplit(a[i], '')[[1]], strsplit(b[i], '')[[1]])$LCS,
          collapse = ""))
like image 81
eddi Avatar answered Oct 05 '22 00:10

eddi


If you dont mind using bioconductor packages, then, You can use Rlibstree. The installation is pretty straightforward.

source("http://bioconductor.org/biocLite.R")
biocLite("Rlibstree") 

Then, you can do:

require(Rlibstree)
ll <- list(a,b)
lapply(data.frame(do.call(rbind, ll), stringsAsFactors=FALSE), 
           function(x) getLongestCommonSubstring(x))

# $X1
# [1] "ABC"

# $X2
# [1] "DEF"

On a side note: I'm not quite sure if Rlibstree uses libstree 0.42 or libstree 0.43. Both libraries are present in the source package. I remember running into a memory leak (and hence an error) on a huge array in perl that was using libstree 0.42. Just a heads up.

like image 20
Arun Avatar answered Oct 04 '22 23:10

Arun