Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fast count of word matches in dictionary for vector of texts in R

Tags:

text

r

I have a very long vector of brief texts in R (say, length 10 million). The first five items of the list are as follows:

  1. "I am an angry tiger."
  2. "I am unhappy clam."
  3. "I am an angry and unhappy tiger."
  4. "I am an angry, angry, tiger."
  5. "Beep boop."

I have a dictionary, which we will say is composed of the words "angry" and "unhappy".

What is the fastest way to obtain a count of matches from this dictionary on the vector of texts? In this case, the correct answer would be vector [1, 1, 2, 2, 0].

I have tried solutions involving quanteda and tm and basically they all fail because I cannot store any large document-feature matrix in memory. Bonus points for any solution using qdap, dplyr, and termco.

like image 526
mlachans Avatar asked Dec 08 '22 20:12

mlachans


2 Answers

Using stringi package,

library(stringi)
stri_count_regex(v1, paste(v2, collapse = '|'))
#[1] 1 1 2 2 0

DATA

dput(v1)
c("I am an angry tiger.", "I am unhappy clam.", "I am an angry and unhappy tiger.", 
"I am an angry, angry, tiger.", "Beep boop.")
dput(v2)
c("angry", "unhappy")
like image 144
Sotos Avatar answered May 24 '23 10:05

Sotos


We can use base R methods with regexpr and Reduce

Reduce(`+`, lapply(dict, function(x) lengths(regmatches(txt, gregexpr(x, txt)))))
#[1] 1 1 2 2 0

Or a faster approach would be

Reduce(`+`, lapply(dict, function(x) vapply(gregexpr(x, txt),
          function(y) sum(attr(y, "match.length")>0), 0)))
#[1] 1 1 2 2 0

NOTE: With large datasets and large number of dictionary elements, this method will not have any limitations.

data

txt <- c("I am an angry tiger.", "I am unhappy clam.", "I am an angry and unhappy tiger.", 
          "I am an angry, angry, tiger." ,"Beep boop.") 
dict <- c("angry", "unhappy")
like image 29
akrun Avatar answered May 24 '23 10:05

akrun