Fast count of word matches in dictionary for vector of texts in R

Question

I have a very long vector of brief texts in R (say, length 10 million). The first five items of the list are as follows:

"I am an angry tiger."
"I am unhappy clam."
"I am an angry and unhappy tiger."
"I am an angry, angry, tiger."
"Beep boop."

I have a dictionary, which we will say is composed of the words "angry" and "unhappy".

What is the fastest way to obtain a count of matches from this dictionary on the vector of texts? In this case, the correct answer would be vector [1, 1, 2, 2, 0].

I have tried solutions involving quanteda and tm and basically they all fail because I cannot store any large document-feature matrix in memory. Bonus points for any solution using qdap, dplyr, and termco.

Sotos · Accepted Answer

Using stringi package,

library(stringi)
stri_count_regex(v1, paste(v2, collapse = '|'))
#[1] 1 1 2 2 0

DATA

dput(v1)
c("I am an angry tiger.", "I am unhappy clam.", "I am an angry and unhappy tiger.", 
"I am an angry, angry, tiger.", "Beep boop.")
dput(v2)
c("angry", "unhappy")

akrun · Answer

We can use base R methods with regexpr and Reduce

Reduce(`+`, lapply(dict, function(x) lengths(regmatches(txt, gregexpr(x, txt)))))
#[1] 1 1 2 2 0

Or a faster approach would be

Reduce(`+`, lapply(dict, function(x) vapply(gregexpr(x, txt),
          function(y) sum(attr(y, "match.length")>0), 0)))
#[1] 1 1 2 2 0

NOTE: With large datasets and large number of dictionary elements, this method will not have any limitations.

data

txt <- c("I am an angry tiger.", "I am unhappy clam.", "I am an angry and unhappy tiger.", 
          "I am an angry, angry, tiger." ,"Beep boop.") 
dict <- c("angry", "unhappy")

Fast count of word matches in dictionary for vector of texts in R

Tags:

text

r

mlachans

2 Answers

Sotos

data

akrun

Recent Activity

Donate For Us

Fast count of word matches in dictionary for vector of texts in R

Tags:

text

r

mlachans

2 Answers

Sotos

data

akrun

Related questions

Recent Activity

Donate For Us