Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to optimize string detection for speed?

When I do text analysis, I frequently want to figure out whether a large number of documents contains any element of a list of strings. If I have millions of documents (e.g. tweets) and a long list of patterns, this can take a long time.

I usually use the following packages to optimize for speed: data.table dtplyr stringr

What are some best practices to optimize string detection and analysis thereof? Are there packages that would allow me to optimize code like this:

library(data.table)
library(dtplyr)
library(stringr)

my_dt <- data.table(text = c("this is some text", "this is some more text")) #imagine many more strings
my_string <- paste(words, collapse = "|")

lazy_dt(my_dt, immutable = F) %>%
filter(filtered_text = str_detect(text, my_string)) %>%
as.data.table()

I would assume that using data.table directly instead of the dtplyr implementation would increase speed. Are there any other ways to improve performance for this kind of application?


I looked at this question and was hoping I could get some similar guidance. Hopefully, the question is specific enough as it is now.

like image 303
Tea Tree Avatar asked Mar 20 '26 00:03

Tea Tree


1 Answers

As I mentioned in the comments str_detect(text, my_string) is the bottleneck in your code. Also note that is does not exactly do what you are expecting. It does a regex search, so all the words that have an "a" in the text would be counted as well. See examples below.

library(data.table)
library(dtplyr)
library(stringr)
library(dplyr)


my_dt <- data.table(id = 1:300000,
                    text = rep(c("this is some text", "this is some more text", 
                             "text palabras"), 100000)) #imagine many more strings
my_string <- paste(stringr::words, collapse = "|")

# start counting time (note System.time() is slightly faster but doesn't print the results)
timing <- Sys.time()

run code
lazy_dt(my_dt, immutable = F) %>%
  filter(filtered_text = str_detect(text, my_string)) %>%
  as.data.table()

            id                   text
     1:      1      this is some text
     2:      2 this is some more text
     3:      3          text palabras
     4:      4      this is some text
     5:      5 this is some more text
    ---                              
299996: 299996 this is some more text
299997: 299997          text palabras
299998: 299998      this is some text
299999: 299999 this is some more text
300000: 300000          text palabras

Sys.time() - timing
Time difference of 6.708245 secs

Note: the equivalent data.table code of your code above is the following:

my_dt[str_detect(text, my_string), ]

Timing this is about 6.52 seconds, so not much of an improvement.

As you can see from the result above, this selection returns all the sentences because there is an "a" in palabras. This shouldn't be here. Now data.table has a function called %chin% which is like %in% but for character vectors and a lot faster. To get the match on words we just need to tokenize the lot, which can be done with unnest_tokens from tidytext. This function respects the data.table format. Afterwards I filter the data on the matching words, drop the word column and take distinct (unique) the data.table. The reason is that the result can have duplicate lines as multiple words can be a match. Even though there are more function calls this is about 3 times as fast.

library(tidytext)

timing <- Sys.time()
my_dt <- unnest_tokens(my_dt, word, text, drop = F)
my_dt <- unique(my_dt[word %chin% words, ], by = c("id", "text"))[, c("id", "text")]


           id                   text
     1:     1      this is some text
     2:     2 this is some more text
     3:     4      this is some text
     4:     5 this is some more text
     5:     7      this is some text
    ---                             
199996: 299993 this is some more text
199997: 299995      this is some text
199998: 299996 this is some more text
199999: 299998      this is some text
200000: 299999 this is some more text

Sys.time() - timing
Time difference of 2.380911 secs

Now to speed things up a bit more, you can set the threads data.table uses. By default (on my system) this is set to 2. You can check this with getDTthreads(). When I add 1 thread with setDTthreads(3) the new code returns in about 1.6 secs.

Now maybe someone can speed this up a bit more, by doing this in the .SD part of data.table.

like image 134
phiver Avatar answered Mar 21 '26 19:03

phiver



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!