How is Levenshtein Distance calculated on Simplified Chinese characters?

Tags:

I have 2 queries:

    query1:你好世界
    query2:你好

When i run this code using the python library Levenshtein:

from Levenshtein import distance, hamming, median
lev_edit_dist = distance(query1,query2)
print lev_edit_dist

I get an output of 12. Now the question is how is the value 12 derived?

Because in terms of strokes difference, theres definitely more than 12.

987

asked Jun 19 '15 00:06

jxn

1 Answers

According to its documentation, it supports unicode:

It supports both normal and Unicode strings, but can't mix them, all arguments to a function (method) have to be of the same type (or its subclasses).

You need to make sure the Chinese characters are in unicode though:

In [1]: from Levenshtein import distance, hamming, median

In [2]: query1 = '你好世界'

In [3]: query2 = '你好'

In [4]: print distance(query1,query2)
6

In [5]: print distance(query1.decode('utf8'),query2.decode('utf8'))
2

answered Sep 29 '22 03:09

Fabricator

Related questions
                            
                                How to find on which line a specific word is. [python]
                            
                                Pydoop stucks on readline from HDFS files
                            
                                Python - Logging in to Site with SAML 2.0
                            
                                Django TransactionTestCase with rollback emulation
                            
                                What is Pep8 ErrorCode E41?
                            
                                Is it possible to use the "app factory" pattern from Flask with Click CLI applications?
                            
                                Recommended usage of Python dictionary, functions as values
                            
                                Python Regex - replace a string not located between two specific words
                            
                                What is meant by "Not a valid choice" when using wtform validation for one field?
                            
                                Django: How to run a function when server exits?
                            
                                ImportError: No module named 'util'
                            
                                Reading file to stdout with twisted
                            
                                Python import statements in complex package structures?
                            
                                How do I convert list of correlations to covariance matrix?
                            
                                Use a list as the key in a Python dict
                            
                                Nested detail_route in django-rest-framework
                            
                                How to determine a numpy-array reshape strategy
                            
                                Convert a list of bytes to a string in Python 3
                            
                                Pass Python list to Rust function
                            
                                Can I use the same @property setter for multiple properties?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How is Levenshtein Distance calculated on Simplified Chinese characters?

Tags:

python

string

unicode

levenshtein-distance

edit-distance

jxn

People also ask

1 Answers

Fabricator

Recent Activity

Donate For Us