Here's an appeal for a better way to do something that I can already do inefficiently: filter a series of n-gram tokens using "stop words" so that the occurrence of any stop word term in an n-gram triggers removal.
I'd very much like to have one solution that works for both unigrams and n-grams, although it would be ok to have two versions, one with a "fixed" flag and one with a "regex" flag. I'm putting the two aspects of the question together since someone may have a solution that tries a different approach that addresses both fixed and regular expression stopword patterns.
Formats:
tokens are a list of character vectors, which may be unigrams, or n-grams concatenated by a _
(underscore) character.
stopwords are a character vector. Right now I am content to let this be a fixed string, but it would be a nice bonus to be able to implement this using regular expression formatted stopwords too.
Desired Output: A list of characters matching the input tokens but with any component token matching a stop word being removed. (This means a unigram match, or a match to one of the terms which the n-gram comprises.)
Examples, test data, and working code and benchmarks to build on:
tokens1 <- list(text1 = c("this", "is", "a", "test", "text", "with", "a", "few", "words"),
text2 = c("some", "more", "words", "in", "this", "test", "text"))
tokens2 <- list(text1 = c("this_is", "is_a", "a_test", "test_text", "text_with", "with_a", "a_few", "few_words"),
text2 = c("some_more", "more_words", "words_in", "in_this", "this_text", "text_text"))
tokens3 <- list(text1 = c("this_is_a", "is_a_test", "a_test_text", "test_text_with", "text_with_a", "with_a_few", "a_few_words"),
text2 = c("some_more_words", "more_words_in", "words_in_this", "in_this_text", "this_text_text"))
stopwords <- c("is", "a", "in", "this")
# remove any single token that matches a stopword
removeTokensOP1 <- function(w, stopwords) {
lapply(w, function(x) x[-which(x %in% stopwords)])
}
# remove any word pair where a single word contains a stopword
removeTokensOP2 <- function(w, stopwords) {
matchPattern <- paste0("(^|_)", paste(stopwords, collapse = "(_|$)|(^|_)"), "(_|$)")
lapply(w, function(x) x[-grep(matchPattern, x)])
}
removeTokensOP1(tokens1, stopwords)
## $text1
## [1] "test" "text" "with" "few" "words"
##
## $text2
## [1] "some" "more" "words" "test" "text"
removeTokensOP2(tokens1, stopwords)
## $text1
## [1] "test" "text" "with" "few" "words"
##
## $text2
## [1] "some" "more" "words" "test" "text"
removeTokensOP2(tokens2, stopwords)
## $text1
## [1] "test_text" "text_with" "few_words"
##
## $text2
## [1] "some_more" "more_words" "text_text"
removeTokensOP2(tokens3, stopwords)
## $text1
## [1] "test_text_with"
##
## $text2
## [1] "some_more_words"
# performance benchmarks for answers to build on
require(microbenchmark)
microbenchmark(OP1_1 = removeTokensOP1(tokens1, stopwords),
OP2_1 = removeTokensOP2(tokens1, stopwords),
OP2_2 = removeTokensOP2(tokens2, stopwords),
OP2_3 = removeTokensOP2(tokens3, stopwords),
unit = "relative")
## Unit: relative
## expr min lq mean median uq max neval
## OP1_1 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 100
## OP2_1 5.119066 3.812845 3.438076 3.714492 3.547187 2.838351 100
## OP2_2 5.230429 3.903135 3.509935 3.790143 3.631305 2.510629 100
## OP2_3 5.204924 3.884746 3.578178 3.753979 3.553729 8.240244 100
This isn't really an answer - more of a comment to reply to rawr's comment of going through all combinations of stopwords. With a longer stopwords
list, using something like %in%
doesn't seem to suffer that dimensionality issue.
library(purrr)
removetokenstst <- function(tokens, stopwords)
map2(tokens,
lapply(tokens3, function(x) {
unlist(lapply(strsplit(x, "_"), function(y) {
any(y %in% stopwords)
}))
}),
~ .x[!.y])
require(microbenchmark)
microbenchmark(OP1_1 = removeTokensOP1(tokens1, morestopwords),
OP2_1 = removeTokensOP2(tokens1, morestopwords),
OP2_2 = removeTokensOP2(tokens2, morestopwords),
OP2_3 = removeTokensOP2(tokens3, morestopwords),
Ak_3 = removetokenstst(tokens3, stopwords),
Ak_3msw = removetokenstst(tokens3, morestopwords),
unit = "relative")
Unit: relative
expr min lq mean median uq max neval
OP1_1 1.00000 1.00000 1.000000 1.000000 1.000000 1.00000 100
OP2_1 278.48260 176.22273 96.462854 79.787932 76.904987 38.31767 100
OP2_2 280.90242 181.22013 98.545148 81.407928 77.637006 64.94842 100
OP2_3 279.43728 183.11366 114.879904 81.404236 82.614739 72.04741 100
Ak_3 15.74301 14.83731 9.340444 7.902213 8.164234 11.27133 100
Ak_3msw 18.57697 14.45574 12.936594 8.513725 8.997922 24.03969 100
Stopwords
morestopwords = c("a", "about", "above", "after", "again", "against", "all",
"am", "an", "and", "any", "are", "arent", "as", "at", "be", "because",
"been", "before", "being", "below", "between", "both", "but",
"by", "cant", "cannot", "could", "couldnt", "did", "didnt", "do",
"does", "doesnt", "doing", "dont", "down", "during", "each",
"few", "for", "from", "further", "had", "hadnt", "has", "hasnt",
"have", "havent", "having", "he", "hed", "hell", "hes", "her",
"here", "heres", "hers", "herself", "him", "himself", "his",
"how", "hows", "i", "id", "ill", "im", "ive", "if", "in", "into",
"is", "isnt", "it", "its", "its", "itself", "lets", "me", "more",
"most", "mustnt", "my", "myself", "no", "nor", "not", "of", "off",
"on", "once", "only", "or", "other", "ought", "our", "ours",
"ourselves", "out", "over", "own", "same", "shant", "she", "shed",
"shell", "shes", "should", "shouldnt", "so", "some", "such",
"than", "that", "thats", "the", "their", "theirs", "them", "themselves",
"then", "there", "theres", "these", "they", "theyd", "theyll",
"theyre", "theyve", "this", "those", "through", "to", "too",
"under", "until", "up", "very", "was", "wasnt", "we", "wed",
"well", "were", "weve", "were", "werent", "what", "whats", "when",
"whens", "where", "wheres", "which", "while", "who", "whos",
"whom", "why", "whys", "with", "wont", "would", "wouldnt", "you",
"youd", "youll", "youre", "youve", "your", "yours", "yourself",
"yourselves", "a", "b", "c", "d", "e", "f", "g", "h", "i", "j",
"k", "l", "m", "n", "o", "p", "q", "r", "s", "t", "u", "v", "w",
"x", "y", "z")
We can improve on the lapply
if you have many levels in your list using the parallel
package.
Create many levels
tokens2 <- list(text1 = c("this_is", "is_a", "a_test", "test_text", "text_with", "with_a", "a_few", "few_words"),
text2 = c("some_more", "more_words", "words_in", "in_this", "this_text", "text_text"))
tokens2 <- lapply(1:500,function(x) sample(tokens2,1)[[1]])
We do this because the parallel package has a lot of overhead to set up, so just increasing the number of iterations on microbenchmark will continue to incur that cost. By increasing the size of the list, you see the true improvement.
library(parallel)
#Setup
cl <- detectCores()
cl <- makeCluster(cl)
#Two functions:
#original
removeTokensOP2 <- function(w, stopwords) {
matchPattern <- paste0("(^|_)", paste(stopwords, collapse = "(_|$)|(^|_)"), "(_|$)")
lapply(w, function(x) x[-grep(matchPattern, x)])
}
#new
removeTokensOPP <- function(w, stopwords) {
matchPattern <- paste0("(^|_)", paste(stopwords, collapse = "(_|$)|(^|_)"), "(_|$)")
return(w[-grep(matchPattern, w)])
}
#compare
microbenchmark(
OP2_P = parLapply(cl,tokens2,removeTokensOPP,stopwords),
OP2_2 = removeTokensOP2(tokens2, stopwords),
unit = 'relative'
)
Unit: relative
expr min lq mean median uq max neval
OP2_P 1.000000 1.000000 1.000000 1.000000 1.000000 1.00000 100
OP2_2 1.730565 1.653872 1.678781 1.562258 1.471347 10.11306 100
As the number of levels in your list increases, the performance will improve.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With