This question is just asking for an implementation in R of the following question : Find the longest common starting substring in a set of strings (JavaScript)
"This problem is a more specific case of the Longest common substring problem. I need to only find the longest common starting substring in an array".
So im just looking an R implementation for this question (preferably not in a for / while loop fashion that was suggested in the JavaScript version), if possible i would like to wrap it up as a function, so i could apply on many groups in a data table.
After some searches, i couldn't find an R example for this, hence this question.
Example Data: I have the following vector of characters:
dput(data)
c("ADA4417-3ARMZ-R7", "ADA4430-1YKSZ-R2", "ADA4430-1YKSZ-R7",
"ADA4431-1YCPZ-R2", "ADA4432-1BCPZ-R7", "ADA4432-1BRJZ-R2")
I'm looking to run an algorithm in R that will find the following output: ADA44
.
From what I've seen in the JavaScript accepted answer, the idea is to first sort the vector, extract the first and last elements (for example : "ADA4417-3ARMZ-R7"
and "ADA4432-1BRJZ-R2"
, break them into single characters, and loop through them until one of the characters don't match (hope im right)
Any Help on that would be great!
Taking inspiration from what you suggested, you can try this function :
comsub<-function(x) {
# sort the vector
x<-sort(x)
# split the first and last element by character
d_x<-strsplit(x[c(1,length(x))],"")
# compute the cumulative sum of common elements
cs_x<-cumsum(d_x[[1]]==d_x[[2]])
# check if there is at least one common element
if(cs_x[1]!=0) {
# see when it stops incrementing and get the position of last common element
der_com<-which(diff(cs_x)==0)[1]
# return the common part
return(substr(x[1],1,der_com))
} else { # else, return an empty vector
return(character(0))
}
}
UPDATE
Following @nicola suggestion, a simpler and more elegant variant for the function:
comsub<-function(x) {
# sort the vector
x<-sort(x)
# split the first and last element by character
d_x<-strsplit(x[c(1,length(x))],"")
# search for the first not common element and so, get the last matching one
der_com<-match(FALSE,do.call("==",d_x))-1
# if there is no matching element, return an empty vector, else return the common part
ifelse(der_com==0,return(character(0)),return(substr(x[1],1,der_com)))
}
Examples:
With your data
x<-c("ADA4417-3ARMZ-R7", "ADA4430-1YKSZ-R2", "ADA4430-1YKSZ-R7",
"ADA4431-1YCPZ-R2", "ADA4432-1BCPZ-R7", "ADA4432-1BRJZ-R2")
> comsub(x)
#[1] "ADA44"
When there is no common starting substring
x<-c("abc","def")
> comsub(x)
# character(0)
A non-base
alternative, using the lcprefix
function in Biostrings
to find the "Longest Common Prefix [...] of two strings"
source("http://bioconductor.org/biocLite.R")
biocLite("Biostrings")
library(Biostrings)
x2 <- sort(x)
substr(x2[1], start = 1, stop = lcprefix(x2[1], x2[length(x2)]))
# [1] "ADA44"
Piggybacking off Henrik's answer, Bioconductor has a C based prefix function and an R based one. The R based one is:
lcPrefix <- function (x, ignore.case = FALSE)
{
x <- as.character(x)
if (ignore.case)
x <- toupper(x)
nc <- nchar(x, type = "char")
for (i in 1:min(nc)) {
ss <- substr(x, 1, i)
if (any(ss != ss[1])) {
return(substr(x[1], 1, i - 1))
}
}
substr(x[1], 1, i)
}
<environment: namespace:Biobase>
... and doesn't require any special features of Bioconductor (as far as I can tell).
--- Citation ---
Orchestrating high-throughput genomic analysis with Bioconductor. W. Huber, V.J. Carey, R. Gentleman, ..., M. Morgan Nature Methods,
2015:12, 115.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With