I am trying to assign similarity score based on comparison between 2 strings. Is there a function for the same in R. I am aware of such a function in SAS by the name of SPEDIS. Please let me know if there is such a function in R.
The way to check the similarity between any data point or groups is by calculating the distance between those data points. In textual data as well, we check the similarity between the strings by calculating the distance between one text to another text.
If you want the result to be in the range [0, 1] , you need to divide the distance by the maximum possible distance between two strings of given lengths. That is, length(str1)+length(str2) for the LCS distance and max(length(str1), length(str2)) for the Levenshtein distance.
How Does the Levenshtein Distance Work? The Levenshtein distance is a similarity measure between words. Given two words, the distance measures the number of edits needed to transform one word into another.
The simplest way to check if two strings are equal in Python is to use the == operator. And if you are looking for the opposite, then != is what you need. That's it!
The function adist computes the Levenshtein edit distance between two strings. This can be transformed into a similarity metric as 1 - (Levenshtein edit distance / longer string length).
The levenshteinSim
function in the RecordLinkage package also does this directly, and might be faster than adist
.
library(RecordLinkage) > levenshteinSim("apple", "apple") [1] 1 > levenshteinSim("apple", "aaple") [1] 0.8 > levenshteinSim("apple", "appled") [1] 0.8333333 > levenshteinSim("appl", "apple") [1] 0.8
ETA: Interestingly, while levenshteinDist
in the RecordLinkage package appears to be slightly faster than adist
, levenshteinSim
is considerably slower than either. Using the rbenchmark package:
> benchmark(levenshteinDist("applesauce", "aaplesauce"), replications=100000) test replications elapsed relative 1 levenshteinDist("applesauce", "aaplesauce") 100000 4.012 1 user.self sys.self user.child sys.child 1 3.583 0.452 0 0 > benchmark(adist("applesauce", "aaplesauce"), replications=100000) test replications elapsed relative user.self 1 adist("applesauce", "aaplesauce") 100000 4.277 1 3.707 sys.self user.child sys.child 1 0.461 0 0 > benchmark(levenshteinSim("applesauce", "aaplesauce"), replications=100000) test replications elapsed relative 1 levenshteinSim("applesauce", "aaplesauce") 100000 7.206 1 user.self sys.self user.child sys.child 1 6.49 0.743 0 0
This overhead is due simply to the code for levenshteinSim
, which is just a wrapper around levenshteinDist
:
> levenshteinSim function (str1, str2) { return(1 - (levenshteinDist(str1, str2)/pmax(nchar(str1), nchar(str2)))) }
FYI: if you are always comparing two strings rather than vectors, you can create a new version that uses max
instead of pmax
and shave ~25% off the running time:
mylevsim = function (str1, str2) { return(1 - (levenshteinDist(str1, str2)/max(nchar(str1), nchar(str2)))) } > benchmark(mylevsim("applesauce", "aaplesauce"), replications=100000) test replications elapsed relative user.self 1 mylevsim("applesauce", "aaplesauce") 100000 5.608 1 4.987 sys.self user.child sys.child 1 0.627 0 0
Long story short- there is little difference between adist
and levenshteinDist
in terms of performance, though the former is preferable if you don't want to add package dependencies. How you turn it into a similarity measure does have a bit of an effect on performance.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With