Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Efficiently counting occurrences of a word list in a column using R

If I have a word list, how do I efficiently count the number of occurrences of these words in a dataset?

An example:

set.seed(123) 
df_data <- data.frame(   
   data_strings = sample(c("tom smith", "smith jim", "sam sam", "ted", "xxx"), 10, replace = TRUE)
)

df_names <- data.frame(
   names = c("tom", "jim", "sam", "ted", "yyy")
)

That is:

> df_data
   data_strings
1       sam sam
2       sam sam
3     smith jim
4     smith jim
5       sam sam
6           xxx
7           ted
8     tom smith
9     smith jim
10      sam sam

and

> df_names
  names
1   tom
2   jim
3   sam
4   ted
5   yyy

I can do this with str_count from the stringr package:

library(stringr)
library(tictoc)
tic()
df_data$counts <- as.vector(sapply(
  paste(df_names[,"names"], collapse='|'), 
  str_count, 
  string=df_data$data_strings
))
toc()

This produces the desired result:

> df_data
   data_strings counts
1       sam sam      2
2       sam sam      2
3     smith jim      1
4     smith jim      1
5       sam sam      2
6           xxx      0
7           ted      1
8     tom smith      1
9     smith jim      1
10      sam sam      2

However, since my real data contains millions of rows, and my word list is also in the millions. This turns out to be a very inefficient way yo get the result. How can I speed it up? I tried to utilize more cores with the parallel package, but it finishes in the same time (it only uses one core although I tell it to use more than one). I'm on windows so I cannot test mclapply(). The parallel seems to be working correctly, since I can get it to use more cores on other examples.

library(stringr)
library(parallel)
library(tictoc)

cl <- makeCluster(4, type = "PSOCK")
tic()
df_data$counts <- as.vector(parSapply(
  cl = cl,
  paste(df_names[,"names"], collapse='|'),
  FUN=str_count, 
  string=df_data$data_strings
))
toc()
stopCluster(cl)

What other approaches could I try? Something with data.tables? Can the paste inside the apply be done differently?

like image 712
Joshua Avatar asked Sep 05 '25 03:09

Joshua


2 Answers

I'm not sure if it is faster on the real size dataset but you can use quanteda which has multicore support built in and should be pretty efficient in this case:

library(dplyr)
library(quanteda)
quanteda_options("threads" = 4) # choose how many threads are used

df_data$counts <- df_data %>%
  pull(data_strings) %>% 
  dfm() %>%                               # construct document-feature-matrix
  dfm_keep(pattern = df_names$names) %>%  # keep features that are names
  convert(to = "data.frame") %>%          # convert to data.frame
  select(-document) %>%                   # remove non-numeric columns
  rowSums()                               # only keep sums

df_data
#>    data_strings counts
#> 1       sam sam      2
#> 2       sam sam      2
#> 3     smith jim      1
#> 4     smith jim      1
#> 5       sam sam      2
#> 6           xxx      0
#> 7           ted      1
#> 8     tom smith      1
#> 9     smith jim      1
#> 10      sam sam      2

Created on 2020-01-13 by the reprex package (v0.3.0)

Note that I set the option stringsAsFactors = FALSE while constructing the data.frames. Otherwise you will run into problems with factors.

I could imagine that this is faster if you have a lot of names in your set. But in my bench marking the stringr::str_count and stringi::stri_count_regex were faster with the small set of names you provided.

like image 170
JBGruber Avatar answered Sep 07 '25 22:09

JBGruber


str_count() is already vectorized, you don't need sapply(), just use stringr::str_count(df_data$data_strings, paste(df_names$names, collapse='|')).

like image 42
F. Privé Avatar answered Sep 07 '25 21:09

F. Privé