I have a column of string names, and I would like to find often occuring patterns (words). Is there a way to return, say, strings with a higher (or equal) length than X, and occur more often than Y times in the the whole column?
column <- c("bla1okay", "okay1243bla", "blaokay", "bla12okay", "okaybla")
getOftenOccuringPatterns <- function(.....)
getOftenOccuringPatterns(column, atleaststringsize=3, atleasttimes=4)
> what times
[1] bla 5
[2] okay 5
Referring to the comment by Tim:
I would like the nested ones to be removed, so if there is "aaaaaaa" and "aaaa" and both would occur in the output, only "aaaaaaa" and the times that one occurs counts.
If atleaststringsize=3
and atleaststringsize=4
, both the output will be the same. Lets say atleasttimes=10
, and "aaaaaaaa" occurs 15 times and "aaaaaa" occurs 15 times, then:
getOftenOccurringPatterns(column, atleaststringsize=3, atleasttimes=10)
> what times
[1] aaaaaaaa 15
and
getOftenOccurringPatterns(column, atleaststringsize=4, atleasttimes=10)
> what times
[1] aaaaaaaa 15
The longest one stays, and it's the same thing for both atleast=3, and atleast=4.
Its in no way tested and wont win any speed races:
getOftenOccuringPatterns <- function(column, atleaststringsize, atleasttimes, uniqueInColumns = FALSE){
res <-
lapply(column,function(x){
lapply(atleaststringsize:nchar(x),function(y){
if(uniqueInColumns){
unique(substring(x, 1:(nchar(x)-y+1), y:nchar(x)))
}else{
substring(x, 1:(nchar(x)-y+1), y:nchar(x))
}
})
})
orderedRes <- unlist(res)[order(unlist(res))]
encodedRes <- rle(orderedRes)
partRes <- with(encodedRes, {check = (lengths >= atleasttimes);
list(what = values[check], times = lengths[check])})
testRes <- sapply(partRes$what, function(x){length(grep(x, partRes$what)) > 1})
lapply(partRes, '[', !testRes)
}
column <- c("bla1okay", "okay1243bla", "blaokay", "bla12okay", "okaybla")
getOftenOccuringPatterns(column, atleaststringsize=3, atleasttimes=4)
$what
"bla" "okay"
$times
5 5
getOftenOccuringPatterns(c("aaaaaaaa", "aaaaaaa", "aaaaaa", "aaaaa", "aaaa", "aaa"), atleaststringsize=3, atleasttimes=4)
$what
[1] "aaaaaa"
$times
[1] 6
getOftenOccuringPatterns(c("aaaaaaaa", "aaaaaaa", "aaaaaa", "aaaaa", "aaaa", "aaa"), atleaststringsize=3, atleasttimes=4, uniqueInColumn = TRUE)
$what
[1] "aaaaa"
$times
[1] 4
This creates a vector of all occurrences of all substrings; it does so naively, iterating over the maximum length of the input string max(nchar(x)) and looking for all subsequences of length 1, 2, ... max(nchar(x)), so scales in polynomial time -- it won't be efficient for super-large problems.
This revision incorporates the following changes:
.accumulate
in inner and outer loops of the previous version implemented the dreaded "copy-and-append" pattern; now we accumulate results in a pre-allocated list answer0
and then accumulate these after the inner loop.
allSubstrings()
has arguments min_occur
, min_nchar
(and max_nchar
) to restrict the search space. In particular, min_occur
(the minimum number of times a substring must occur to be retained) helps to reduce the length of the character vector in which longer substrings are searched.
The function .filter()
can be used to more aggressively remove strings that do not contain substrings of length i; this can be costly, so there's a heuristic and argument useFilter
that can be set. The use of a filter makes the whole solution seem more like a hack than an algorithm -- the information about substrings has already been extracted, so we shouldn't have to go back and search for their occurrence again.
Here is the revised main function
allSubstrings <-
function(x, min_occur=1L, min_nchar=1L, max_nchar=max(nchar(x)),
..., useFilter=max(nchar(x)) > 100L)
{
len <- nchar(x)
x <- x[len >= min_nchar]; len <- len[len >= min_nchar]
answer <- vector("list", max_nchar - min_nchar + 1L)
for (i in seq(min_nchar, max_nchar)) {
## suffix of length i, starting at character j
x0 <- x; len0 <- len; n <- max(len0) - i + 1L
answer0 <- vector("list", n)
for (j in seq_len(n)) {
end <- j + i - 1L
f <- factor(substr(x0, j, end))
answer0[[j]] <- setNames(tabulate(f), levels(f))
x0 <- x0[len0 != end]; len0 <- len0[len0 != end]
}
answer0 <- unlist(answer0) # accumulate across start positions
answer0 <- vapply(split(answer0, names(answer0)), sum, integer(1))
answer0 <- answer0[answer0 >= min_occur]
if (length(answer0) == 0L)
break
answer[[i - min_nchar + 1L]] <- answer0
idx <- len != i # no need to process some strings
if (useFilter)
idx[idx] <- .filter(x[idx], names(answer0))
x <- x[idx]; len <- len[idx]
if (length(x) == 0L)
break
}
unlist(answer[seq_len(i)])
}
and the .filter
function
.filter <-
function(s, q)
{
## which 's' contain at least one 'q'
answer <- rep(FALSE, length(s))
idx <- !answer # use this to minimize the number of greps
for (elt in q) {
answer[idx] <- answer[idx] | grepl(elt, s[idx], fixed=TRUE)
idx[idx] <- !answer[idx]
}
answer
}
As before result is a named vector, where the names are the strings and the values are the counts of their occurrence.
> column <- c("bla1okay", "okay1243bla", "blaokay", "bla12okay", "okaybla")
> xx <- allSubstrings(column)
> head(sort(xx, decreasing=TRUE))
a b o k l y
10 5 5 5 5 5
> xtabs(~nchar(names(xx)) + xx)
xx
nchar(names(xx)) 1 2 3 5 10
1 2 1 1 5 1
2 8 2 0 5 0
3 15 1 0 3 0
4 20 1 0 1 0
5 22 0 0 0 0
....
Queries like in the original question are then easy to perform, e.g., all substrings of >= 3 characters occurring more than 4 times:
> (ok <- xx[nchar(names(xx)) >= 3 & xx > 4])
bla oka kay okay
5 5 5 5
The code doesn't fully answer the question, e.g., nested substrings are present, but might replace the nested lapply
portion of @user1609452's answer. Post-processing this result to eliminate nested subsequences is a little inelegant, but since the result being post-processed is not large will likely be fast enough, e.g., to eliminate nested substrings
> fun <- function(p, q) length(grep(p, q, fixed=TRUE))
> ok[ sapply(names(ok), fun, names(ok)) == 1L ]
bla okay
5 5
Here we use the 99k word dictionary on my laptop for input, with some basic timings for the revised algorithm
> timer <- function(n, x, ...)
system.time(allSubstrings(head(x, n), ...))[[3]]
> n <- c(100, 1000, 10000, 20000)
> data.frame(n=n, elapsed=sapply(n, timer, words))
n elapsed
1 100 0.050
2 1000 0.074
3 10000 0.490
4 20000 1.031
This is about 10x faster than the original algorithm, due in this case entirely to revision 1 (using pre-allocate and fill, followed by accumulation).
Here's a corpus of longer sentences
shakes <- readLines("http://www.gutenberg.org/cache/epub/100/pg100.txt")
shakes <- paste(shakes[nchar(shakes) != 0], collapse=" ")
shakes <- gsub(" +", " ", shakes)
shakes <- strsplit(shakes, "\\. +",)[[1]]
and some timings. This benefits alot from specifying a min_occur
argument, and from use of the filter.
> n <- c(100, 1000, 2000, 5000)
> data.frame(n=n, elapsed=sapply(n, timer, shakes, min_occur=10))
n elapsed
1 100 1.725
2 1000 7.724
3 2000 12.415
4 5000 60.914
The need to use a filter and the poor performance on longer strings leads one to want to arrive at a better algorithm, like suffix array; the "Rlibstree" package might also be useful, although I'm unsure of where to get a current version or whether the exposed part of the interface is sufficient to answer the original question.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With