Similarity scores based on string comparison in R (edit distance)

Tags:

I am trying to assign similarity score based on comparison between 2 strings. Is there a function for the same in R. I am aware of such a function in SAS by the name of SPEDIS. Please let me know if there is such a function in R.

672

asked Jul 18 '12 06:07

Kunal Batra

1 Answers

The function adist computes the Levenshtein edit distance between two strings. This can be transformed into a similarity metric as 1 - (Levenshtein edit distance / longer string length).

The levenshteinSim function in the RecordLinkage package also does this directly, and might be faster than adist.

library(RecordLinkage) > levenshteinSim("apple", "apple") [1] 1 > levenshteinSim("apple", "aaple") [1] 0.8 > levenshteinSim("apple", "appled") [1] 0.8333333 > levenshteinSim("appl", "apple") [1] 0.8

ETA: Interestingly, while levenshteinDist in the RecordLinkage package appears to be slightly faster than adist, levenshteinSim is considerably slower than either. Using the rbenchmark package:

> benchmark(levenshteinDist("applesauce", "aaplesauce"), replications=100000)                                          test replications elapsed relative 1 levenshteinDist("applesauce", "aaplesauce")       100000   4.012        1   user.self sys.self user.child sys.child 1     3.583    0.452          0         0 > benchmark(adist("applesauce", "aaplesauce"), replications=100000)                                test replications elapsed relative user.self 1 adist("applesauce", "aaplesauce")       100000   4.277        1     3.707   sys.self user.child sys.child 1    0.461          0         0 > benchmark(levenshteinSim("applesauce", "aaplesauce"), replications=100000)                                         test replications elapsed relative 1 levenshteinSim("applesauce", "aaplesauce")       100000   7.206        1   user.self sys.self user.child sys.child 1      6.49    0.743          0         0

This overhead is due simply to the code for levenshteinSim, which is just a wrapper around levenshteinDist:

> levenshteinSim function (str1, str2)  {     return(1 - (levenshteinDist(str1, str2)/pmax(nchar(str1),          nchar(str2)))) }

FYI: if you are always comparing two strings rather than vectors, you can create a new version that uses max instead of pmax and shave ~25% off the running time:

mylevsim = function (str1, str2)  {     return(1 - (levenshteinDist(str1, str2)/max(nchar(str1),          nchar(str2)))) } > benchmark(mylevsim("applesauce", "aaplesauce"), replications=100000)                                   test replications elapsed relative user.self 1 mylevsim("applesauce", "aaplesauce")       100000   5.608        1     4.987   sys.self user.child sys.child 1    0.627          0         0

Long story short- there is little difference between adist and levenshteinDist in terms of performance, though the former is preferable if you don't want to add package dependencies. How you turn it into a similarity measure does have a bit of an effect on performance.

194

answered Oct 25 '22 12:10

David Robinson

Related questions
                            
                                SVG to PNG Server side - using node.js
                            
                                Combine a list of data.tables
                            
                                Android - How to set a custom font for whole app [duplicate]
                            
                                How to get the exact full path to a file in Perl
                            
                                Is there any way to debug c++ dll called from C# DllImport?
                            
                                Django update table using data from another table
                            
                                Can I convert embedded base64 encode font to a font file?
                            
                                Finding the nth occurrence of a character in a string in javascript
                            
                                Aligning div right next to the list items
                            
                                Why does WebStorm's LiveEdit not work in Chrome?
                            
                                jQuery sum of input values in sections
                            
                                Javascript Error Null is not an Object

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With