I have 2 queries:
query1:你好世界
query2:你好
When i run this code using the python library Levenshtein:
from Levenshtein import distance, hamming, median
lev_edit_dist = distance(query1,query2)
print lev_edit_dist
I get an output of 12. Now the question is how is the value 12 derived?
Because in terms of strokes difference, theres definitely more than 12.
The Levenshtein distance is usually calculated by preparing a matrix of size (M+1)x(N+1) —where M and N are the lengths of the 2 words—and looping through said matrix using 2 for loops, performing some calculations within each iteration.
The Levenshtein distance used as a metric provides a boost to accuracy of an NLP model by verifying each named entity in the entry. The vector search solution does a good job, and finds the most similar entry as defined by the vectorization.
The Levenshtein distance (a.k.a edit distance) is a measure of similarity between two strings. It is defined as the minimum number of changes required to convert string a into string b (this is done by inserting, deleting or replacing a character in string a ).
The restricted Damerau Levenshtein Distance between two strings is commonly used for checking typographical errors in strings. It takes the deletion and insertion of a character, a wrong character (substition) or the swapping (transposition) of two characters into account.
According to its documentation, it supports unicode:
It supports both normal and Unicode strings, but can't mix them, all arguments to a function (method) have to be of the same type (or its subclasses).
You need to make sure the Chinese characters are in unicode though:
In [1]: from Levenshtein import distance, hamming, median
In [2]: query1 = '你好世界'
In [3]: query2 = '你好'
In [4]: print distance(query1,query2)
6
In [5]: print distance(query1.decode('utf8'),query2.decode('utf8'))
2
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With