How python-Levenshtein.ratio is computed

Tags:

levenshtein-distance

According to the python-Levenshtein.ratio source:

https://github.com/miohtama/python-Levenshtein/blob/master/Levenshtein.c#L722

it's computed as (lensum - ldist) / lensum. This works for

# pip install python-Levenshtein import Levenshtein Levenshtein.distance('ab', 'a') # returns 1 Levenshtein.ratio('ab', 'a')    # returns 0.666666

However, it seems to break with

Click to copy

Levenshtein.distance('ab', 'ac')  # returns 1 Levenshtein.ratio('ab', 'ac')     # returns 0.5

I feel I must be missing something very simple.. but why not 0.75?

998

asked Jan 10 '13 14:01

2 Answers

Levenshtein distance for 'ab' and 'ac' as below:

so alignment is:

Click to copy

  a c   a b

Alignment length = 2
number of mismatch = 1

Levenshtein Distance is 1 because only one substitutions is required to transfer ac into ab (or reverse)

Distance ratio = (Levenshtein Distance)/(Alignment length ) = 0.5

EDIT

you are writing

(lensum - ldist) / lensum = (1 - ldist/lensum) = 1 - 0.5 = 0.5.

But this is matching (not distance)
REFFRENCE, you may notice its written

Matching %

Click to copy

p = (1 - l/m) × 100

Where l is the levenshtein distance and m is the length of the longest of the two words:

_{(notice: some author use longest of the two, I used alignment length)}

Click to copy

(1 - 3/7) × 100 = 57.14...      (Word 1    Word 2    RATIO   Mis-Match   Match%    AB         AB         0       0        (1 - 0/2 )*100  = 100%      CD         AB         1       2        (1 - 2/2 )*100  = 0%       AB         AC        .5       1        (1 - 1/2 )*100  = 50%

_{Why some authors divide by alignment length,other by max length of one of both?.., because Levenshtein don't consider gap. Distance = number of edits (insertion + deletion + replacement), While Needleman–Wunsch algorithm that is standard global alignment consider gap. This is (gap) difference between Needleman–Wunsch and Levenshtein, so much of paper use max distance between two sequences (BUT THIS IS MY OWN UNDERSTANDING, AND IAM NOT SURE 100%)}

Here is IEEE TRANSACTIONS ON PAITERN ANALYSIS : Computation of Normalized Edit Distance and Applications In this paper Normalized Edit Distance as followed:

Given two strings X and Y over a finite alphabet, the normalized edit distance between X and Y, d( X , Y ) is defined as the minimum of W( P ) / L ( P )w, here P is an editing path between X and Y , W ( P ) is the sum of the weights of the elementary edit operations of P, and L(P) is the number of these operations (length of P).

135

answered Oct 05 '22 10:10

Grijesh Chauhan

By looking more carefully at the C code, I found that this apparent contradiction is due to the fact that ratio treats the "replace" edit operation differently than the other operations (i.e. with a cost of 2), whereas distance treats them all the same with a cost of 1.

This can be seen in the calls to the internal levenshtein_common function made within ratio_py function:

https://github.com/miohtama/python-Levenshtein/blob/master/Levenshtein.c#L727

Click to copy

static PyObject* ratio_py(PyObject *self, PyObject *args) {   size_t lensum;   long int ldist;    if ((ldist = levenshtein_common(args, "ratio", 1, &lensum)) < 0) //Call     return NULL;    if (lensum == 0)     return PyFloat_FromDouble(1.0);    return PyFloat_FromDouble((double)(lensum - ldist)/(lensum)); }

and by distance_py function:

https://github.com/miohtama/python-Levenshtein/blob/master/Levenshtein.c#L715

Click to copy

static PyObject* distance_py(PyObject *self, PyObject *args) {   size_t lensum;   long int ldist;    if ((ldist = levenshtein_common(args, "distance", 0, &lensum)) < 0)     return NULL;    return PyInt_FromLong((long)ldist); }

which ultimately results in different cost arguments being sent to another internal function, lev_edit_distance, which has the following doc snippet:

Click to copy

@xcost: If nonzero, the replace operation has weight 2, otherwise all         edit operations have equal weights of 1.

Code of lev_edit_distance():

Click to copy

/**  * lev_edit_distance:  * @len1: The length of @string1.  * @string1: A sequence of bytes of length @len1, may contain NUL characters.  * @len2: The length of @string2.  * @string2: A sequence of bytes of length @len2, may contain NUL characters.  * @xcost: If nonzero, the replace operation has weight 2, otherwise all  *         edit operations have equal weights of 1.  *  * Computes Levenshtein edit distance of two strings.  *  * Returns: The edit distance.  **/ _LEV_STATIC_PY size_t lev_edit_distance(size_t len1, const lev_byte *string1,                   size_t len2, const lev_byte *string2,                   int xcost) {   size_t i;

[ANSWER]

So in my example,

ratio('ab', 'ac') implies a replacement operation (cost of 2), over the total length of the strings (4), hence 2/4 = 0.5.

That explains the "how", I guess the only remaining aspect would be the "why", but for the moment I'm satisfied with this understanding.

answered Oct 05 '22 12:10

cjauvin

Related questions
                            
                                Atlassian Bamboo with Django & Python - Possible?
                            
                                README extension for Python projects
                            
                                Pip: Specifying minor version
                            
                                Setting exit code in Python when an exception is raised
                            
                                Is there a max size, max no. of columns, max rows?
                            
                                Did something about `namedtuple` change in 3.5.1?
                            
                                Copy numpy array into part of another array
                            
                                Python decompressing gzip chunk-by-chunk
                            
                                How should I perform imports in a python module without polluting its namespace?
                            
                                Why is math.factorial much slower in Python 2.x than 3.x?
                            
                                Python package import from parent directory
                            
                                Tkinter assign button command in loop with lambda
                            
                                How to do group by on a multiindex in pandas?
                            
                                Structure of inputs to scipy minimize function
                            
                                Python Matplotlib - how to specify values on y axis?
                            
                                Missing data, insert rows in Pandas and fill with NAN
                            
                                Seaborn boxplot + stripplot: duplicate legend
                            
                                Adding attributes to python objects
                            
                                SQLAlchemy - build query filter dynamically from dict
                            
                                Convert a JSON schema to a python class

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How python-Levenshtein.ratio is computed

Tags:

python

levenshtein-distance

cjauvin

People also ask

2 Answers

Grijesh Chauhan

cjauvin

Recent Activity

Donate For Us