If I have a word list, how do I efficiently count the number of occurrences of these words in a dataset?
An example:
set.seed(123)
df_data <- data.frame(
data_strings = sample(c("tom smith", "smith jim", "sam sam", "ted", "xxx"), 10, replace = TRUE)
)
df_names <- data.frame(
names = c("tom", "jim", "sam", "ted", "yyy")
)
That is:
> df_data
data_strings
1 sam sam
2 sam sam
3 smith jim
4 smith jim
5 sam sam
6 xxx
7 ted
8 tom smith
9 smith jim
10 sam sam
and
> df_names
names
1 tom
2 jim
3 sam
4 ted
5 yyy
I can do this with str_count
from the stringr
package:
library(stringr)
library(tictoc)
tic()
df_data$counts <- as.vector(sapply(
paste(df_names[,"names"], collapse='|'),
str_count,
string=df_data$data_strings
))
toc()
This produces the desired result:
> df_data
data_strings counts
1 sam sam 2
2 sam sam 2
3 smith jim 1
4 smith jim 1
5 sam sam 2
6 xxx 0
7 ted 1
8 tom smith 1
9 smith jim 1
10 sam sam 2
However, since my real data contains millions of rows, and my word list is also in the millions. This turns out to be a very inefficient way yo get the result. How can I speed it up? I tried to utilize more cores with the parallel
package, but it finishes in the same time (it only uses one core although I tell it to use more than one). I'm on windows so I cannot test mclapply()
. The parallel
seems to be working correctly, since I can get it to use more cores on other examples.
library(stringr)
library(parallel)
library(tictoc)
cl <- makeCluster(4, type = "PSOCK")
tic()
df_data$counts <- as.vector(parSapply(
cl = cl,
paste(df_names[,"names"], collapse='|'),
FUN=str_count,
string=df_data$data_strings
))
toc()
stopCluster(cl)
What other approaches could I try? Something with data.tables? Can the paste inside the apply be done differently?
I'm not sure if it is faster on the real size dataset but you can use quanteda
which has multicore support built in and should be pretty efficient in this case:
library(dplyr)
library(quanteda)
quanteda_options("threads" = 4) # choose how many threads are used
df_data$counts <- df_data %>%
pull(data_strings) %>%
dfm() %>% # construct document-feature-matrix
dfm_keep(pattern = df_names$names) %>% # keep features that are names
convert(to = "data.frame") %>% # convert to data.frame
select(-document) %>% # remove non-numeric columns
rowSums() # only keep sums
df_data
#> data_strings counts
#> 1 sam sam 2
#> 2 sam sam 2
#> 3 smith jim 1
#> 4 smith jim 1
#> 5 sam sam 2
#> 6 xxx 0
#> 7 ted 1
#> 8 tom smith 1
#> 9 smith jim 1
#> 10 sam sam 2
Created on 2020-01-13 by the reprex package (v0.3.0)
Note that I set the option stringsAsFactors = FALSE
while constructing the data.frames
. Otherwise you will run into problems with factors.
I could imagine that this is faster if you have a lot of names in your set. But in my bench marking the stringr::str_count
and stringi::stri_count_regex
were faster with the small set of names you provided.
str_count()
is already vectorized, you don't need sapply()
, just use stringr::str_count(df_data$data_strings, paste(df_names$names, collapse='|'))
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With