I have two character variables (names of objects) and I want to extract the largest common substring. <pre class="prettyprint"><code>a <- c('blahABCfoo', 'blahDEFfoo') b <- c('XXABC-123', 'XXDEF-123') </code></pre> I want the following as a result: <pre class="prettyprint"><code>[1] "ABC" "DEF" </code></pre> These vectors as input should give the same result: <pre class="prettyprint"><code>a <- c('textABCxx', 'textDEFxx') b <- c('zzABCblah', 'zzDEFblah') </code></pre> These examples are representative. The strings contain identifying elements, and the remainder of the text in each vector element is common, but unknown. Is there a solution, in one of the following places (in order of preference): <ol> <li>Base R</li> <li>Recommended Packages</li> <li>Packages available on CRAN</li> </ol> The answer to the supposed-duplicate does not fulfill these requirements.

Here's a CRAN package for that: <pre class="prettyprint"><code>library(qualV) sapply(seq_along(a), function(i) paste(LCS(strsplit(a[i], '')[[1]], strsplit(b[i], '')[[1]])$LCS, collapse = "")) </code></pre>

Find common substrings between two character variables

Tags:

r

lcs

I have two character variables (names of objects) and I want to extract the largest common substring.

a <- c('blahABCfoo', 'blahDEFfoo')
b <- c('XXABC-123', 'XXDEF-123')

I want the following as a result:

[1] "ABC" "DEF"

These vectors as input should give the same result:

a <- c('textABCxx', 'textDEFxx')
b <- c('zzABCblah', 'zzDEFblah')

These examples are representative. The strings contain identifying elements, and the remainder of the text in each vector element is common, but unknown.

Is there a solution, in one of the following places (in order of preference):

Base R
Recommended Packages
Packages available on CRAN

The answer to the supposed-duplicate does not fulfill these requirements.

361

asked Apr 24 '13 15:04

Matthew Lundberg

2 Answers

Here's a CRAN package for that:

library(qualV)

sapply(seq_along(a), function(i)
    paste(LCS(strsplit(a[i], '')[[1]], strsplit(b[i], '')[[1]])$LCS,
          collapse = ""))

answered Oct 05 '22 00:10

eddi

If you dont mind using bioconductor packages, then, You can use Rlibstree. The installation is pretty straightforward.

source("http://bioconductor.org/biocLite.R")
biocLite("Rlibstree")

Then, you can do:

require(Rlibstree)
ll <- list(a,b)
lapply(data.frame(do.call(rbind, ll), stringsAsFactors=FALSE), 
           function(x) getLongestCommonSubstring(x))

# $X1
# [1] "ABC"

# $X2
# [1] "DEF"

On a side note: I'm not quite sure if Rlibstree uses libstree 0.42 or libstree 0.43. Both libraries are present in the source package. I remember running into a memory leak (and hence an error) on a huge array in perl that was using libstree 0.42. Just a heads up.

answered Oct 04 '22 23:10

Arun

Related questions
                            
                                Can't reproduce stat_smooth using `loess` when x-axis is Date
                            
                                compute only diagonals of matrix multiplication in R
                            
                                data.table: anonymous function in j
                            
                                guess_formats + R + lubridate
                            
                                Including images in R-package documentation (.Rd) files
                            
                                Ignoring case sensitvity in dplyr joins
                            
                                Changing the Projection of Shapefile
                            
                                R check warning: Files in the 'vignettes' directory but no files in 'inst/doc'
                            
                                flexdashboard - change title bar color
                            
                                Ungroup after grouping by just one variable in dplyr
                            
                                Create a list of all values of a variable grouped by another variable in R
                            
                                How to change caption label names in a single document with Bookdown?
                            
                                Merge separate divergent size and fill (or color) legends in ggplot showing absolute magnitude with the size scale
                            
                                Generating a vector of the number of items in each list item
                            
                                How do I plot only the time portion of a timestamp including a date?
                            
                                Ordering stacks by size in a ggplot2 stacked bar graph
                            
                                formatter argument in scale_continuous throwing errors in R 2.15
                            
                                Combining S4 and S3 methods in a single function
                            
                                How to resize a NumericVector?
                            
                                How to use 'facet' to create multiple density plot in GGPLOT

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With