algorithm/code in R to find pattern from any position in a string

Tags:

I want to find the pattern from any position in any given string such that the pattern repeats for a threshold number of times at least. For example for the string "a0cc0vaaaabaaaabaaaabaa00bvw" the pattern should come out to be "aaaab". Another example: for the string "ff00f0f0f0f0f0f0f0f0000" the pattern should be "0f". In both cases threshold has been taken as 3 i.e. the pattern should be repeated for at least 3 times.

If someone can suggest an optimized method in R for finding a solution to this problem, please do share with me. Currently I am achieving this by using 3 nested loops, and it's taking a lot of time.

Thanks!

798

asked Jan 09 '14 12:01

phoenix

4 Answers

Use regular expressions, which are made for this type of stuff. There may be more optimized ways of doing it, but in terms of easy to write code, it's hard to beat. The data:

vec <- c("a0cc0vaaaabaaaabaaaabaa00bvw","ff00f0f0f0f0f0f0f0f0000")

The function that does the matching:

find_rep_path <- function(vec, reps) {
  regexp <- paste0(c("(.+)", rep("\\1", reps - 1L)), collapse="")
  match <- regmatches(vec, regexpr(regexp, vec, perl=T))
  substr(match, 1, nchar(match) / reps)  
}

And some tests:

sapply(vec, find_rep_path, reps=3L)
# a0cc0vaaaabaaaabaaaabaa00bvw      ff00f0f0f0f0f0f0f0f0000 
#                      "aaaab"                       "0f0f" 
sapply(vec, find_rep_path, reps=5L)
# $a0cc0vaaaabaaaabaaaabaa00bvw
# character(0)
# 
# $ff00f0f0f0f0f0f0f0f0000
# [1] "0f"

Note that with threshold as 3, the actual longest pattern for the second string is 0f0f, not 0f (reverts to 0f at threshold 5). In order to do this, I use back references (\\1), and repeat these as many time as necessary to reach threshold. I need to then substr the result because annoyingly base R doesn't have an easy way to get just the captured sub expressions when using perl compatible regular expressions. There is probably a not too hard way to do this, but the substr approach works well in this example.

Also, as per the discussion in @G. Grothendieck's answer, here is the version with the cap on length of pattern, which is just adding the limit argument and the slight modification of the regexp.

find_rep_path <- function(vec, reps, limit) {
  regexp <- paste0(c("(.{1,", limit,"})", rep("\\1", reps - 1L)), collapse="")
  match <- regmatches(vec, regexpr(regexp, vec, perl=T))
  substr(match, 1, nchar(match) / reps)  
}
sapply(vec, find_rep_path, reps=3L, limit=3L)
# a0cc0vaaaabaaaabaaaabaa00bvw      ff00f0f0f0f0f0f0f0f0000 
#                          "a"                         "0f"

197

answered Oct 11 '22 01:10

BrodieG

find.string finds substring of maximum length subject to (1) substring must be repeated consecutively at least th times and (2) substring length must be no longer than len.

reps <- function(s, n) paste(rep(s, n), collapse = "") # repeat s n times

find.string <- function(string, th = 3, len = floor(nchar(string)/th)) {
    for(k in len:1) {
        pat <- paste0("(.{", k, "})", reps("\\1", th-1))
        r <- regexpr(pat, string, perl = TRUE)
        if (attr(r, "capture.length") > 0) break
    }
    if (r > 0) substring(string, r, r + attr(r, "capture.length")-1) else ""
}

and here are some tests. The last test processes the entire text of James Joyce's Ulysses in 1.4 seconds on my laptop:

> find.string("a0cc0vaaaabaaaabaaaabaa00bvw")
[1] "aaaab"
> find.string("ff00f0f0f0f0f0f0f0f0000")
[1] "0f0f"
> 
> joyce <- readLines("http://www.gutenberg.org/files/4300/4300-8.txt") 
> joycec <- paste(joyce, collapse = " ") 
> system.time(result <- find.string2(joycec, len = 25))

   user  system elapsed 
   1.36    0.00    1.39 
> result
[1] " Hoopsa boyaboy hoopsa!"

ADDED

Although I developed my answer before having seen BrodieG's, as he points out they are very similar to each other. I have added some features of his to the above to get the solution below and tried the tests again. Unfortunately when I added the variation of his code the James Joyce example no longer works although it does work on the other two examples shown. The problem seems to be in adding the len constraint to the code and may represent a fundamental advantage of the code above (i.e. it can handle such a constraint and such constraints may be essential for very long strings).

find.string2 <- function(string, th = 3, len = floor(nchar(string)/th)) {
    pat <- paste0(c("(.", "{1,", len, "})", rep("\\1", th-1)), collapse = "")
    r <- regexpr(pat, string, perl = TRUE)
    ifelse(r > 0, substring(string, r, r + attr(r, "capture.length")-1), "")
}

> find.string2("a0cc0vaaaabaaaabaaaabaa00bvw")
[1] "aaaab"
> find.string2("ff00f0f0f0f0f0f0f0f0000")
[1] "0f0f"

> system.time(result <- find.string2(joycec, len = 25))
   user  system elapsed 
      0       0       0 
> result
[1] "w"

REVISED The James Joyce test that was supposed to be testing find.string2 was actually using find.string. This is now fixed.

answered Oct 11 '22 01:10

G. Grothendieck

Not optimized (even it is fast) function , but I think it is more R way to do this.

Get all patterns of certains length > threshold : vectorized using mapply and substr
Get the occurrence of these patterns and extract the one with maximum occurrence : vectorized using str_locate_all.
Repeat 1-2 this for all lengths and tkae the one with maximum occurrence.

Here my code. I am creating 2 functions ( steps 1-2) and step 3:

library(stringr)
ss = "ff00f0f0f0f0f0f0f0f0000" 
ss <- "a0cc0vaaaabaaaabaaaabaa00bvw"
find_pattern_length <- 
function(length=1,ss){
  patt = mapply(function(x,y) substr(ss,x,y),
                1:(nchar(ss)-length),
                (length+1):nchar(ss))
  res = str_locate_all(ss,unique(patt))
  ll = unlist(lapply(res,length))
  list(patt = patt[which.max(ll)],
       rep = max(ll))
}

get_pattern_threshold <- 
function(ss,threshold =3 ){
  res <- 
  sapply(seq(threshold,nchar(ss)),find_pattern_length,ss=ss)
  res[,which.max(res['rep',])]
}

some tests:

get_pattern_threshold('ff00f0f0f0f0f0f0f0f0000',5)
$patt
[1] "0f0f0"

$rep
[1] 6

> get_pattern_threshold('ff00f0f0f0f0f0f0f0f0000',2)
$patt
[1] "f0"

$rep
[1] 18

answered Oct 11 '22 01:10

agstudy

Since you want at least three repetitions, there is a nice O(n^2) approach.

For each possible pattern length d cut string into parts of length d. In case of d=5 it would be:

a0cc0
vaaaa
baaaa
baaaa
baa00
bvw

Now look at each pairs of subsequent strings A[k] and A[k+1]. If they are equal then there is a pattern of at least two repetitions. Then go further (k+2, k+3) and so on. Finally you also check if suffix of A[k-1] and prefix of A[k+n] fit (where k+n is the first string that doesn't match).

Repeat it for each d starting from some upper bound (at most n/3).

You have n/3 possible lengths, then n/d strings of length d to check for each d. It should give complexity O(n (n/d) d)= O(n^2).

Maybe not optimal but I found this cutting idea quite neat ;)

answered Oct 11 '22 02:10

Łukasz Kidziński

Related questions
                            
                                Printing 2 evenly populated lists side by side evenly
                            
                                In javascript how to convert string to array and array to string
                            
                                Split String in Swift by their capital letters
                            
                                Count number of characters in a string, create a data frame column out of it? [duplicate]
                            
                                Recursively find text in files (PowerShell)
                            
                                Replace newlines with <p> paragraph and with <br /> tags
                            
                                c++ passing a string literal instead of a const std::string&?
                            
                                How to Substitute a String if record is NULL in T-SQL
                            
                                Split string from the second occurrence of the character
                            
                                C++ cannot convert 'const char*' to 'std::string*'
                            
                                How to string.find the square bracket character in lua?
                            
                                Default C# String encoding
                            
                                How to build SPARQL queries in java?
                            
                                C# Split String Into Separate Variables
                            
                                Trim String in Java while preserve full word
                            
                                How do you debug an xml object that causes a transform error when writing to string?
                            
                                Slice Assignment with a String in a List
                            
                                Is there a function in Matlab similar to the java function String.split(delimiter)?
                            
                                Java String Split On Non-Alphabetic Characters
                            
                                isalpha python function won't consider spaces

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

algorithm/code in R to find pattern from any position in a string

Tags:

string

loops

pattern-matching

r

phoenix

People also ask

4 Answers

BrodieG

G. Grothendieck

agstudy

Łukasz Kidziński

Recent Activity

Donate For Us