Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fast string matching in R

Tags:

string

r

I'm attempting to do a word matching in a large data-set. I'm wondering if there is a way of speeding up the slowest operation in my workflow.

What I aim to do is to find the locations of the matches between a dictionary of words and a list of word vectors.

words <- c("cat", "dog", "snake", "cow")
scores <- c(1.5, 0.7, 3.5, 4.6)
dic <- data.frame(words, scores)

wordList <- list(c("jiraffe", "dog"), c("cat", "elephant"), c("snake", "cow"))

The fastest way I have found so far to do this is by doing this:

matches <- function(wordList) {
    subD <- which(dic$words %in% wordList)
}

My desired output is:

matches(wordList):
list(c(2), c(1), c(3, 4))

which I can later use to get the average score per wordList cell by doing

averageScore <- sapply(matches, function(x) {mean(dic[x, "scores"]})

Is there a faster way of doing the string matching than what I am doing in the function:

subD <- which(dic$words %in% wordList)

I have tried the dplyr way, thinking it might be faster, using first "filter" to get a subset of the "dic" and applying "colMeans" on it, but it seems to be twice as slow.

Also, running my matches function in a loop is just as slow as using "lapply" on it.

Am I missing something? Is there a way that is faster than both?

like image 698
alexvicegrab Avatar asked May 06 '15 01:05

alexvicegrab


1 Answers

Here's one option:

library(data.table)
nn <- lengths(wordList)  ## Or, for < R-3.2.0, `nn <- sapply(wordList, length)` 
dt <- data.table(grp=rep(seq_along(nn), times=nn), X = unlist(wordList), key="grp")
dt[,Score:=scores[chmatch(X,words)]]
dt[!is.na(Score), list(avgScore=mean(Score)), by="grp"]
#    grp avgScore
# 1:   1     0.70
# 2:   2     1.50
# 3:   3     4.05
like image 143
Josh O'Brien Avatar answered Nov 15 '22 22:11

Josh O'Brien