Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R implementation for Finding the longest common starting substrings in a set of strings

Tags:

string

r

This question is just asking for an implementation in R of the following question : Find the longest common starting substring in a set of strings (JavaScript)

"This problem is a more specific case of the Longest common substring problem. I need to only find the longest common starting substring in an array".

So im just looking an R implementation for this question (preferably not in a for / while loop fashion that was suggested in the JavaScript version), if possible i would like to wrap it up as a function, so i could apply on many groups in a data table.

After some searches, i couldn't find an R example for this, hence this question.

Example Data: I have the following vector of characters:

dput(data)
c("ADA4417-3ARMZ-R7", "ADA4430-1YKSZ-R2", "ADA4430-1YKSZ-R7", 
"ADA4431-1YCPZ-R2", "ADA4432-1BCPZ-R7", "ADA4432-1BRJZ-R2")

I'm looking to run an algorithm in R that will find the following output: ADA44.

From what I've seen in the JavaScript accepted answer, the idea is to first sort the vector, extract the first and last elements (for example : "ADA4417-3ARMZ-R7" and "ADA4432-1BRJZ-R2" , break them into single characters, and loop through them until one of the characters don't match (hope im right)

Any Help on that would be great!

like image 356
Yehoshaphat Schellekens Avatar asked Feb 02 '15 08:02

Yehoshaphat Schellekens


3 Answers

Taking inspiration from what you suggested, you can try this function :

comsub<-function(x) {
    # sort the vector
    x<-sort(x)
    # split the first and last element by character
    d_x<-strsplit(x[c(1,length(x))],"")
    # compute the cumulative sum of common elements
    cs_x<-cumsum(d_x[[1]]==d_x[[2]])
    # check if there is at least one common element
    if(cs_x[1]!=0) {
        # see when it stops incrementing and get the position of last common element
        der_com<-which(diff(cs_x)==0)[1]
        # return the common part
        return(substr(x[1],1,der_com))
    } else { # else, return an empty vector
        return(character(0))
    }
}

UPDATE

Following @nicola suggestion, a simpler and more elegant variant for the function:

comsub<-function(x) {
    # sort the vector
    x<-sort(x)
    # split the first and last element by character
    d_x<-strsplit(x[c(1,length(x))],"")
    # search for the first not common element and so, get the last matching one
    der_com<-match(FALSE,do.call("==",d_x))-1
    # if there is no matching element, return an empty vector, else return the common part
    ifelse(der_com==0,return(character(0)),return(substr(x[1],1,der_com)))
}

Examples:

With your data

x<-c("ADA4417-3ARMZ-R7", "ADA4430-1YKSZ-R2", "ADA4430-1YKSZ-R7", 
"ADA4431-1YCPZ-R2", "ADA4432-1BCPZ-R7", "ADA4432-1BRJZ-R2")
> comsub(x)
#[1] "ADA44"

When there is no common starting substring

x<-c("abc","def")
> comsub(x)
# character(0)
like image 59
Cath Avatar answered Oct 28 '22 16:10

Cath


A non-base alternative, using the lcprefix function in Biostrings to find the "Longest Common Prefix [...] of two strings"

source("http://bioconductor.org/biocLite.R")
biocLite("Biostrings")
library(Biostrings)

x2 <- sort(x)
substr(x2[1], start = 1, stop = lcprefix(x2[1], x2[length(x2)]))
# [1] "ADA44"
like image 27
Henrik Avatar answered Oct 28 '22 16:10

Henrik


Piggybacking off Henrik's answer, Bioconductor has a C based prefix function and an R based one. The R based one is:

lcPrefix <- function (x, ignore.case = FALSE) 
{
    x <- as.character(x)
    if (ignore.case) 
        x <- toupper(x)
    nc <- nchar(x, type = "char")
    for (i in 1:min(nc)) {
        ss <- substr(x, 1, i)
        if (any(ss != ss[1])) {
            return(substr(x[1], 1, i - 1))
        }
    }
    substr(x[1], 1, i)
}
<environment: namespace:Biobase>

... and doesn't require any special features of Bioconductor (as far as I can tell).

--- Citation ---

Orchestrating high-throughput genomic analysis with Bioconductor. W. Huber, V.J. Carey, R. Gentleman, ..., M. Morgan Nature Methods,
2015:12, 115.

like image 45
russellpierce Avatar answered Oct 28 '22 16:10

russellpierce