Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

jellyfish vs pyjarowinkler

I am trying to use the Jaro-Winkler similarity distance to see if two strings are similar. I tried using both the libraries to compare the words carol and elephant. The results are not similar:

import jellyfish

jellyfish.jaro_winkler('Carol','elephant') 

returns 0.4416666, while

from pyjarowinkler import distance

distance.get_jaro_distance('Carol','elephant')

returns 0.0 which makes more sense to me.

Is there a bug between the two libraries?

like image 847
turtle_in_mind Avatar asked Jan 24 '18 17:01

turtle_in_mind


2 Answers

The Jellyfish implemenation is correct.

Carol and elephant didn't have a matching prefix. Therefore the Jaro-Winkler distance is equal to the Jaro distance in this Case. I calculated the Jaro distance by hand and found that the implementation of Jellyfish is correct. There is an online calculator, but the online calculator is also wrong. I also found some other implementations like in the python-Levenstein package, wich also implements the Jaro-Winkler distance, that validated my calculations. There is also an implemenatation on npm. If you like to compute the score by you own - you can find the paper here

like image 58
Bierbarbar Avatar answered Sep 17 '22 11:09

Bierbarbar


Perhaps worth noting that two different implementations in R seem to match the pyjarowinkler version:

library(stringdist)
> 1 - stringdist("Elephant", "Carol", method = 'jw')
[1] 0.4416667

library(RecordLinkage)
> jarowinkler('Carol','elephant')
[1] 0.4416667
like image 38
AidanGawronski Avatar answered Sep 19 '22 11:09

AidanGawronski