I'm using the 'agrep' function in R, which returns a vector of matches. I would like a function similar to agrep that only returns the best match, or best matches if there are ties. Currently, I am doing this using the 'sdist()' function from the package 'cba' on each element of the resulting vector, but this seems very redundant.
/edit: here is the function I'm currently using. I'd like to speed it up, as it seems redundant to calculate distance twice.
library(cba) word <- 'test' words <- c('Teest','teeeest','New York City','yeast','text','Test') ClosestMatch <- function(string,StringVector) { matches <- agrep(string,StringVector,value=TRUE) distance <- sdists(string,matches,method = "ow",weight = c(1, 0, 2)) matches <- data.frame(matches,as.numeric(distance)) matches <- subset(matches,distance==min(distance)) as.character(matches$matches) } ClosestMatch(word,words)
The agrep package uses Levenshtein Distances to match strings. The package RecordLinkage has a C function to calculate the Levenshtein Distance, which can be used directly to speed up your computation. Here is a reworked ClosestMatch
function that is around 10x faster
library(RecordLinkage) ClosestMatch2 = function(string, stringVector){ distance = levenshteinSim(string, stringVector); stringVector[distance == max(distance)] }
RecordLinkage package was removed from CRAN, use stringdist instead:
library(stringdist) ClosestMatch2 = function(string, stringVector){ stringVector[amatch(string, stringVector, maxDist=Inf)] }
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With