I have a very long vector of brief texts in R (say, length 10 million). The first five items of the list are as follows:
I have a dictionary, which we will say is composed of the words "angry" and "unhappy".
What is the fastest way to obtain a count of matches from this dictionary on the vector of texts? In this case, the correct answer would be vector [1, 1, 2, 2, 0]
.
I have tried solutions involving quanteda
and tm
and basically they all fail because I cannot store any large document-feature matrix in memory. Bonus points for any solution using qdap
, dplyr
, and termco
.
Using stringi
package,
library(stringi)
stri_count_regex(v1, paste(v2, collapse = '|'))
#[1] 1 1 2 2 0
DATA
dput(v1)
c("I am an angry tiger.", "I am unhappy clam.", "I am an angry and unhappy tiger.",
"I am an angry, angry, tiger.", "Beep boop.")
dput(v2)
c("angry", "unhappy")
We can use base R
methods with regexpr
and Reduce
Reduce(`+`, lapply(dict, function(x) lengths(regmatches(txt, gregexpr(x, txt)))))
#[1] 1 1 2 2 0
Or a faster approach would be
Reduce(`+`, lapply(dict, function(x) vapply(gregexpr(x, txt),
function(y) sum(attr(y, "match.length")>0), 0)))
#[1] 1 1 2 2 0
NOTE: With large datasets and large number of dictionary elements, this method will not have any limitations.
txt <- c("I am an angry tiger.", "I am unhappy clam.", "I am an angry and unhappy tiger.",
"I am an angry, angry, tiger." ,"Beep boop.")
dict <- c("angry", "unhappy")
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With