Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Similarity scores based on string comparison in R (edit distance)

Tags:

I am trying to assign similarity score based on comparison between 2 strings. Is there a function for the same in R. I am aware of such a function in SAS by the name of SPEDIS. Please let me know if there is such a function in R.

like image 672
Kunal Batra Avatar asked Jul 18 '12 06:07

Kunal Batra


People also ask

How do you compare similarity between two strings?

The way to check the similarity between any data point or groups is by calculating the distance between those data points. In textual data as well, we check the similarity between the strings by calculating the distance between one text to another text.

How do you calculate normalized levenshtein distance?

If you want the result to be in the range [0, 1] , you need to divide the distance by the maximum possible distance between two strings of given lengths. That is, length(str1)+length(str2) for the LCS distance and max(length(str1), length(str2)) for the Levenshtein distance.

How does levenshtein distance work?

How Does the Levenshtein Distance Work? The Levenshtein distance is a similarity measure between words. Given two words, the distance measures the number of edits needed to transform one word into another.

How do you check if two strings are similar in Python?

The simplest way to check if two strings are equal in Python is to use the == operator. And if you are looking for the opposite, then != is what you need. That's it!


1 Answers

The function adist computes the Levenshtein edit distance between two strings. This can be transformed into a similarity metric as 1 - (Levenshtein edit distance / longer string length).

The levenshteinSim function in the RecordLinkage package also does this directly, and might be faster than adist.

library(RecordLinkage) > levenshteinSim("apple", "apple") [1] 1 > levenshteinSim("apple", "aaple") [1] 0.8 > levenshteinSim("apple", "appled") [1] 0.8333333 > levenshteinSim("appl", "apple") [1] 0.8 

ETA: Interestingly, while levenshteinDist in the RecordLinkage package appears to be slightly faster than adist, levenshteinSim is considerably slower than either. Using the rbenchmark package:

> benchmark(levenshteinDist("applesauce", "aaplesauce"), replications=100000)                                          test replications elapsed relative 1 levenshteinDist("applesauce", "aaplesauce")       100000   4.012        1   user.self sys.self user.child sys.child 1     3.583    0.452          0         0 > benchmark(adist("applesauce", "aaplesauce"), replications=100000)                                test replications elapsed relative user.self 1 adist("applesauce", "aaplesauce")       100000   4.277        1     3.707   sys.self user.child sys.child 1    0.461          0         0 > benchmark(levenshteinSim("applesauce", "aaplesauce"), replications=100000)                                         test replications elapsed relative 1 levenshteinSim("applesauce", "aaplesauce")       100000   7.206        1   user.self sys.self user.child sys.child 1      6.49    0.743          0         0 

This overhead is due simply to the code for levenshteinSim, which is just a wrapper around levenshteinDist:

> levenshteinSim function (str1, str2)  {     return(1 - (levenshteinDist(str1, str2)/pmax(nchar(str1),          nchar(str2)))) } 

FYI: if you are always comparing two strings rather than vectors, you can create a new version that uses max instead of pmax and shave ~25% off the running time:

mylevsim = function (str1, str2)  {     return(1 - (levenshteinDist(str1, str2)/max(nchar(str1),          nchar(str2)))) } > benchmark(mylevsim("applesauce", "aaplesauce"), replications=100000)                                   test replications elapsed relative user.self 1 mylevsim("applesauce", "aaplesauce")       100000   5.608        1     4.987   sys.self user.child sys.child 1    0.627          0         0 

Long story short- there is little difference between adist and levenshteinDist in terms of performance, though the former is preferable if you don't want to add package dependencies. How you turn it into a similarity measure does have a bit of an effect on performance.

like image 194
David Robinson Avatar answered Oct 25 '22 12:10

David Robinson