I'm attempting to do a word matching in a large data-set. I'm wondering if there is a way of speeding up the slowest operation in my workflow.
What I aim to do is to find the locations of the matches between a dictionary of words and a list of word vectors.
words <- c("cat", "dog", "snake", "cow")
scores <- c(1.5, 0.7, 3.5, 4.6)
dic <- data.frame(words, scores)
wordList <- list(c("jiraffe", "dog"), c("cat", "elephant"), c("snake", "cow"))
The fastest way I have found so far to do this is by doing this:
matches <- function(wordList) {
subD <- which(dic$words %in% wordList)
}
My desired output is:
matches(wordList):
list(c(2), c(1), c(3, 4))
which I can later use to get the average score per wordList cell by doing
averageScore <- sapply(matches, function(x) {mean(dic[x, "scores"]})
Is there a faster way of doing the string matching than what I am doing in the function:
subD <- which(dic$words %in% wordList)
I have tried the dplyr way, thinking it might be faster, using first "filter" to get a subset of the "dic" and applying "colMeans" on it, but it seems to be twice as slow.
Also, running my matches function in a loop is just as slow as using "lapply" on it.
Am I missing something? Is there a way that is faster than both?
Here's one option:
library(data.table)
nn <- lengths(wordList) ## Or, for < R-3.2.0, `nn <- sapply(wordList, length)`
dt <- data.table(grp=rep(seq_along(nn), times=nn), X = unlist(wordList), key="grp")
dt[,Score:=scores[chmatch(X,words)]]
dt[!is.na(Score), list(avgScore=mean(Score)), by="grp"]
# grp avgScore
# 1: 1 0.70
# 2: 2 1.50
# 3: 3 4.05
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With