Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Get often occuring string patterns from column in R or Python

Tags:

python

regex

r

I have a column of string names, and I would like to find often occuring patterns (words). Is there a way to return, say, strings with a higher (or equal) length than X, and occur more often than Y times in the the whole column?

column <- c("bla1okay", "okay1243bla", "blaokay", "bla12okay", "okaybla")
getOftenOccuringPatterns <- function(.....) 
getOftenOccuringPatterns(column, atleaststringsize=3, atleasttimes=4)
>     what   times 
[1]   bla    5
[2]   okay   5

Referring to the comment by Tim:

I would like the nested ones to be removed, so if there is "aaaaaaa" and "aaaa" and both would occur in the output, only "aaaaaaa" and the times that one occurs counts.

If atleaststringsize=3 and atleaststringsize=4, both the output will be the same. Lets say atleasttimes=10, and "aaaaaaaa" occurs 15 times and "aaaaaa" occurs 15 times, then:

getOftenOccurringPatterns(column, atleaststringsize=3, atleasttimes=10)
>    what      times
[1]  aaaaaaaa    15

and

getOftenOccurringPatterns(column, atleaststringsize=4, atleasttimes=10) 
>    what      times
[1]  aaaaaaaa    15

The longest one stays, and it's the same thing for both atleast=3, and atleast=4.

like image 827
PascalVKooten Avatar asked May 26 '13 08:05

PascalVKooten


2 Answers

Its in no way tested and wont win any speed races:

getOftenOccuringPatterns <- function(column, atleaststringsize, atleasttimes, uniqueInColumns = FALSE){

  res <- 
  lapply(column,function(x){
    lapply(atleaststringsize:nchar(x),function(y){
      if(uniqueInColumns){
        unique(substring(x, 1:(nchar(x)-y+1), y:nchar(x)))
      }else{
        substring(x, 1:(nchar(x)-y+1), y:nchar(x))
      }
    })
  })

  orderedRes <- unlist(res)[order(unlist(res))]
  encodedRes <- rle(orderedRes)

  partRes <- with(encodedRes, {check = (lengths >= atleasttimes);
                               list(what = values[check], times = lengths[check])})
  testRes <- sapply(partRes$what, function(x){length(grep(x, partRes$what)) > 1})

  lapply(partRes, '[', !testRes)

}


column <- c("bla1okay", "okay1243bla", "blaokay", "bla12okay", "okaybla")
getOftenOccuringPatterns(column, atleaststringsize=3, atleasttimes=4)
$what

 "bla" "okay" 

$times

5 5 


getOftenOccuringPatterns(c("aaaaaaaa", "aaaaaaa", "aaaaaa", "aaaaa", "aaaa", "aaa"), atleaststringsize=3, atleasttimes=4)
$what
[1] "aaaaaa"

$times
[1] 6


getOftenOccuringPatterns(c("aaaaaaaa", "aaaaaaa", "aaaaaa", "aaaaa", "aaaa", "aaa"), atleaststringsize=3, atleasttimes=4, uniqueInColumn = TRUE)
$what
[1] "aaaaa"

$times
[1] 4
like image 174
user1609452 Avatar answered Oct 03 '22 09:10

user1609452


This creates a vector of all occurrences of all substrings; it does so naively, iterating over the maximum length of the input string max(nchar(x)) and looking for all subsequences of length 1, 2, ... max(nchar(x)), so scales in polynomial time -- it won't be efficient for super-large problems.

This revision incorporates the following changes:

  1. .accumulate in inner and outer loops of the previous version implemented the dreaded "copy-and-append" pattern; now we accumulate results in a pre-allocated list answer0 and then accumulate these after the inner loop.

  2. allSubstrings() has arguments min_occur, min_nchar (and max_nchar) to restrict the search space. In particular, min_occur (the minimum number of times a substring must occur to be retained) helps to reduce the length of the character vector in which longer substrings are searched.

  3. The function .filter() can be used to more aggressively remove strings that do not contain substrings of length i; this can be costly, so there's a heuristic and argument useFilter that can be set. The use of a filter makes the whole solution seem more like a hack than an algorithm -- the information about substrings has already been extracted, so we shouldn't have to go back and search for their occurrence again.

Here is the revised main function

allSubstrings <-
    function(x, min_occur=1L, min_nchar=1L, max_nchar=max(nchar(x)),
             ..., useFilter=max(nchar(x)) > 100L)
{
    len <- nchar(x)
    x <- x[len >= min_nchar]; len <- len[len >= min_nchar]
    answer <- vector("list", max_nchar - min_nchar + 1L)
    for (i in seq(min_nchar, max_nchar)) {
        ## suffix of length i, starting at character j
        x0 <- x; len0 <- len; n <- max(len0) - i + 1L
        answer0 <- vector("list", n)
        for (j in seq_len(n)) {
            end <- j + i - 1L
            f <- factor(substr(x0, j, end))
            answer0[[j]] <- setNames(tabulate(f), levels(f))
            x0 <- x0[len0 != end]; len0 <- len0[len0 != end]
        }
        answer0 <- unlist(answer0)        # accumulate across start positions
        answer0 <- vapply(split(answer0, names(answer0)), sum, integer(1))
        answer0 <- answer0[answer0 >= min_occur]
        if (length(answer0) == 0L)
            break
        answer[[i - min_nchar + 1L]] <- answer0

        idx <- len != i                   # no need to process some strings
        if (useFilter)
            idx[idx] <- .filter(x[idx], names(answer0))
        x <- x[idx]; len <- len[idx]
        if (length(x) == 0L)
            break
    }
    unlist(answer[seq_len(i)])
}

and the .filter function

.filter <-
    function(s, q)
{
    ## which 's' contain at least one 'q'
    answer <- rep(FALSE, length(s))
    idx <- !answer      # use this to minimize the number of greps
    for (elt in q) {
        answer[idx] <- answer[idx] | grepl(elt, s[idx], fixed=TRUE)
        idx[idx] <- !answer[idx]
    }
    answer
}

As before result is a named vector, where the names are the strings and the values are the counts of their occurrence.

> column <- c("bla1okay", "okay1243bla", "blaokay", "bla12okay", "okaybla")
> xx <- allSubstrings(column)
> head(sort(xx, decreasing=TRUE))
 a  b  o  k  l  y 
10  5  5  5  5  5 
> xtabs(~nchar(names(xx)) + xx)
                xx
nchar(names(xx))  1  2  3  5 10
              1   2  1  1  5  1
              2   8  2  0  5  0
              3  15  1  0  3  0
              4  20  1  0  1  0
              5  22  0  0  0  0
....

Queries like in the original question are then easy to perform, e.g., all substrings of >= 3 characters occurring more than 4 times:

> (ok <- xx[nchar(names(xx)) >= 3 & xx > 4])
 bla  oka  kay okay 
   5    5    5    5 

The code doesn't fully answer the question, e.g., nested substrings are present, but might replace the nested lapply portion of @user1609452's answer. Post-processing this result to eliminate nested subsequences is a little inelegant, but since the result being post-processed is not large will likely be fast enough, e.g., to eliminate nested substrings

> fun <- function(p, q) length(grep(p, q, fixed=TRUE))
> ok[ sapply(names(ok), fun, names(ok)) == 1L ]   
 bla okay 
   5    5 

Here we use the 99k word dictionary on my laptop for input, with some basic timings for the revised algorithm

> timer <- function(n, x, ...)
    system.time(allSubstrings(head(x, n), ...))[[3]]
> n <- c(100, 1000, 10000, 20000)
> data.frame(n=n, elapsed=sapply(n, timer, words))
      n elapsed
1   100   0.050
2  1000   0.074
3 10000   0.490
4 20000   1.031

This is about 10x faster than the original algorithm, due in this case entirely to revision 1 (using pre-allocate and fill, followed by accumulation).

Here's a corpus of longer sentences

shakes <- readLines("http://www.gutenberg.org/cache/epub/100/pg100.txt")
shakes <- paste(shakes[nchar(shakes) != 0], collapse=" ")
shakes <- gsub(" +", " ", shakes)
shakes <- strsplit(shakes, "\\. +",)[[1]]

and some timings. This benefits alot from specifying a min_occur argument, and from use of the filter.

> n <- c(100, 1000, 2000, 5000)
> data.frame(n=n, elapsed=sapply(n, timer, shakes, min_occur=10))
     n elapsed
1  100   1.725
2 1000   7.724
3 2000  12.415
4 5000  60.914

The need to use a filter and the poor performance on longer strings leads one to want to arrive at a better algorithm, like suffix array; the "Rlibstree" package might also be useful, although I'm unsure of where to get a current version or whether the exposed part of the interface is sufficient to answer the original question.

like image 21
Martin Morgan Avatar answered Oct 03 '22 11:10

Martin Morgan