Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How python-Levenshtein.ratio is computed

According to the python-Levenshtein.ratio source:

https://github.com/miohtama/python-Levenshtein/blob/master/Levenshtein.c#L722

it's computed as (lensum - ldist) / lensum. This works for

# pip install python-Levenshtein import Levenshtein Levenshtein.distance('ab', 'a') # returns 1 Levenshtein.ratio('ab', 'a')    # returns 0.666666 

However, it seems to break with

Levenshtein.distance('ab', 'ac')  # returns 1 Levenshtein.ratio('ab', 'ac')     # returns 0.5 

I feel I must be missing something very simple.. but why not 0.75?

like image 998
cjauvin Avatar asked Jan 10 '13 14:01

cjauvin


People also ask

How is Levenshtein ratio calculated?

(lensum - ldist) / lensum = (1 - ldist/lensum) = 1 - 0.5 = 0.5.

What does Python Levenshtein do?

Levenshtein distance is a lexical similarity measure which identifies the distance between one pair of strings. It does so by counting the number of times you would have to insert, delete or substitute a character from string 1 to make it like string 2.

Is Levenshtein distance an algorithm?

The Levenshtein distance is a string metric for measuring difference between two sequences. Informally, the Levenshtein distance between two words is the minimum number of single-character edits (i.e. insertions, deletions or substitutions) required to change one word into the other.

Is Levenshtein distance NLP?

The Levenshtein distance used as a metric provides a boost to accuracy of an NLP model by verifying each named entity in the entry. The vector search solution does a good job, and finds the most similar entry as defined by the vectorization.


2 Answers

Levenshtein distance for 'ab' and 'ac' as below:

image

so alignment is:

  a c   a b  

Alignment length = 2
number of mismatch = 1

Levenshtein Distance is 1 because only one substitutions is required to transfer ac into ab (or reverse)

Distance ratio = (Levenshtein Distance)/(Alignment length ) = 0.5

EDIT

you are writing

(lensum - ldist) / lensum = (1 - ldist/lensum) = 1 - 0.5 = 0.5.

But this is matching (not distance)
REFFRENCE, you may notice its written

Matching %

p = (1 - l/m) × 100 

Where l is the levenshtein distance and m is the length of the longest of the two words:

(notice: some author use longest of the two, I used alignment length)

(1 - 3/7) × 100 = 57.14...      (Word 1    Word 2    RATIO   Mis-Match   Match%    AB         AB         0       0        (1 - 0/2 )*100  = 100%      CD         AB         1       2        (1 - 2/2 )*100  = 0%       AB         AC        .5       1        (1 - 1/2 )*100  = 50%       

Why some authors divide by alignment length,other by max length of one of both?.., because Levenshtein don't consider gap. Distance = number of edits (insertion + deletion + replacement), While Needleman–Wunsch algorithm that is standard global alignment consider gap. This is (gap) difference between Needleman–Wunsch and Levenshtein, so much of paper use max distance between two sequences (BUT THIS IS MY OWN UNDERSTANDING, AND IAM NOT SURE 100%)

Here is IEEE TRANSACTIONS ON PAITERN ANALYSIS : Computation of Normalized Edit Distance and Applications In this paper Normalized Edit Distance as followed:

Given two strings X and Y over a finite alphabet, the normalized edit distance between X and Y, d( X , Y ) is defined as the minimum of W( P ) / L ( P )w, here P is an editing path between X and Y , W ( P ) is the sum of the weights of the elementary edit operations of P, and L(P) is the number of these operations (length of P).

like image 135
Grijesh Chauhan Avatar answered Oct 05 '22 10:10

Grijesh Chauhan


By looking more carefully at the C code, I found that this apparent contradiction is due to the fact that ratio treats the "replace" edit operation differently than the other operations (i.e. with a cost of 2), whereas distance treats them all the same with a cost of 1.

This can be seen in the calls to the internal levenshtein_common function made within ratio_py function:


https://github.com/miohtama/python-Levenshtein/blob/master/Levenshtein.c#L727

static PyObject* ratio_py(PyObject *self, PyObject *args) {   size_t lensum;   long int ldist;    if ((ldist = levenshtein_common(args, "ratio", 1, &lensum)) < 0) //Call     return NULL;    if (lensum == 0)     return PyFloat_FromDouble(1.0);    return PyFloat_FromDouble((double)(lensum - ldist)/(lensum)); } 

and by distance_py function:

https://github.com/miohtama/python-Levenshtein/blob/master/Levenshtein.c#L715

static PyObject* distance_py(PyObject *self, PyObject *args) {   size_t lensum;   long int ldist;    if ((ldist = levenshtein_common(args, "distance", 0, &lensum)) < 0)     return NULL;    return PyInt_FromLong((long)ldist); } 

which ultimately results in different cost arguments being sent to another internal function, lev_edit_distance, which has the following doc snippet:

@xcost: If nonzero, the replace operation has weight 2, otherwise all         edit operations have equal weights of 1. 

Code of lev_edit_distance():

/**  * lev_edit_distance:  * @len1: The length of @string1.  * @string1: A sequence of bytes of length @len1, may contain NUL characters.  * @len2: The length of @string2.  * @string2: A sequence of bytes of length @len2, may contain NUL characters.  * @xcost: If nonzero, the replace operation has weight 2, otherwise all  *         edit operations have equal weights of 1.  *  * Computes Levenshtein edit distance of two strings.  *  * Returns: The edit distance.  **/ _LEV_STATIC_PY size_t lev_edit_distance(size_t len1, const lev_byte *string1,                   size_t len2, const lev_byte *string2,                   int xcost) {   size_t i; 

[ANSWER]

So in my example,

ratio('ab', 'ac') implies a replacement operation (cost of 2), over the total length of the strings (4), hence 2/4 = 0.5.

That explains the "how", I guess the only remaining aspect would be the "why", but for the moment I'm satisfied with this understanding.

like image 45
cjauvin Avatar answered Oct 05 '22 12:10

cjauvin