Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Very Fast string fuzzy matching in R

Tags:

matching

r

fuzzy

I have a set of 40.000 rows x 4 columns and I need to compare each column to itself in order to find the most closest result or the minimum levenshtein distance. The idea is to get an "almost duplicate" for every row. I have calculated with "adist" but seems too slow. For example, for only one column, 5.000 rows compared to all column dataset, 40.000 rows, takes almost 2 hours. This is, for 4 columns, 8 hours, and for the entire dataset, 32 hours. Is there any faster way to achieve the same? I need it to be in 1 or 2 hours if possible. This is an example of what have I done so far:


    #vector example
    a<-as.character(c("hello","allo","hola"))
    b<-as.character(c("hello","allo","hola"))
    
    #execution time
    start_time <- Sys.time()
    
    #Matrix with distance
    dist.name<-adist(a,b, partial = TRUE, ignore.case = TRUE)
    
    #time elapsed
    end_time <- Sys.time()
    end_time - start_time
    
    Output:
    Time difference of 5.873202 secs
    
    #result
    dist.name
          [,1] [,2] [,3]
    [1,]    0    4    5
    [2,]    2    0    2
    [3,]    5    4    0

desired output (minimum distance for every row, but no for the same row), but I need it faster.

[1,] 4
[2,] 2
[3,] 4
like image 294
ecp Avatar asked May 10 '19 06:05

ecp


People also ask

How do I make fuzzy match in R?

Often you may want to join together two datasets in R based on imperfectly matching strings. This is sometimes called fuzzy matching. The easiest way to perform fuzzy matching in R is to use the stringdist_join() function from the fuzzyjoin package.

Is Fuzzy Wuzzy slow?

FuzzyWuzzy package is a Levenshtein distance based method which widely used in computing similarity scores of strings. But why we should not use it? The answer is simple: it is way too slow. The estimated time of computing similarity scores for a 406,000-entity dataset of addresses is 337 hours.

Is fuzzy matching good?

Fuzzy string matching can help improve data quality and accuracy by data deduplication, identification of false-positives etc.

What is levenshtein fuzzy matching?

The Levenshtein Distance (LD) is one of the fuzzy matching techniques that measure between two strings, with the given number representing how far the two strings are from being an exact match. The higher the number of the Levenshtein edit distance, the further the two terms are from being identical.


1 Answers

You could try stringsdist-package.

It's written in C, uses parallel processing and offers various distance metrics, including levenshtein-distance.

library(stringdist)

a<-as.character(c("hello","allo","hola"))
b<-as.character(c("hello","allo","hola"))

start_time <- Sys.time()
res <- stringdistmatrix(a,b, method = "lv")
end_time <- Sys.time()

> end_time - start_time
Time difference of 0.006981134 secs
> res
     [,1] [,2] [,3]
[1,]    0    2    3
[2,]    2    0    3
[3,]    3    3    0


diag(res) <- NA
apply(res, 1, FUN = min, na.rm = T)
[1] 2 2 3
like image 191
Humpelstielzchen Avatar answered Nov 14 '22 23:11

Humpelstielzchen