I am trying to use the Jaro-Winkler similarity distance to see if two strings are similar. I tried using both the libraries to compare the words carol
and elephant
. The results are not similar:
import jellyfish
jellyfish.jaro_winkler('Carol','elephant')
returns 0.4416666
, while
from pyjarowinkler import distance
distance.get_jaro_distance('Carol','elephant')
returns 0.0
which makes more sense to me.
Is there a bug between the two libraries?
The Jellyfish implemenation is correct.
Carol and elephant didn't have a matching prefix. Therefore the Jaro-Winkler distance is equal to the Jaro distance in this Case. I calculated the Jaro distance by hand and found that the implementation of Jellyfish is correct. There is an online calculator, but the online calculator is also wrong. I also found some other implementations like in the python-Levenstein package, wich also implements the Jaro-Winkler distance, that validated my calculations. There is also an implemenatation on npm. If you like to compute the score by you own - you can find the paper here
Perhaps worth noting that two different implementations in R seem to match the pyjarowinkler version:
library(stringdist)
> 1 - stringdist("Elephant", "Carol", method = 'jw')
[1] 0.4416667
library(RecordLinkage)
> jarowinkler('Carol','elephant')
[1] 0.4416667
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With