Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I determine Levenshtein distance for Mandarin Chinese characters?

We are developing a system to do fuzzy matching on over 50 international languages using the UTF-8, UTF-16, and UTF-32 Unicode character standard. So far, we have been able to use Levenshtein distance to detect misspellings of German Unicode extended character words.

We would like to extend this system to handle Mandarin Chinese ideographs represented in Unicode. How would we perform Levenshtein distance calculation between similar Chinese characters?

like image 769
Frank Avatar asked Sep 12 '12 02:09

Frank


People also ask

How is Levenshtein distance calculated?

The Levenshtein distance is usually calculated by preparing a matrix of size (M+1)x(N+1) —where M and N are the lengths of the 2 words—and looping through said matrix using 2 for loops, performing some calculations within each iteration.

What is the difference between edit distance and Levenshtein distance?

Different definitions of an edit distance use different sets of string operations. Levenshtein distance operations are the removal, insertion, or substitution of a character in the string. Being the most common metric, the term Levenshtein distance is often used interchangeably with edit distance.

How does R calculate Levenshtein distance?

To calculate the Levenshtein distance between two vectors in the R Language, we use the stringdist() function of the stringdist package library. The stringdist() function takes two string vectors as arguments and returns a vector that contains the Levenshtein distance between each string pair in them.

Does order matter in Levenshtein distance?

The order of these values should not affect the distance. How would I implement this into the iterative or recursive algorithm? Much simpler solution: write a function that just sorts both input strings and then calls LDistance in the normal way.


1 Answers

Firstly, just to clarify: A Chinese character is not as such equivalent to a German or English word. Most of the things you'd consider as words (using a semantic or syntactic definition of "word") consist of 1-3 characters. It is straightforward to apply Levenshtein distance to such character sequences by representing them as sequences of UCS-2 or UCS-4 code points. As most words are short (esp. words of length 1 or 2 characters), it may be of limited use, though.

However, as your question is specifically about the edit distance between individual characters, I believe a different approach is required, and it may be very difficult indeed.

For a start, you'd have to represent each character as a sequence of the components / strokes it consists of. There are two problems:

  • Some components consist themselves of even smaller components, so how to break a character down into "atomic" components is not uniquely defined. If you do it down to the level of individual strokes, you'd need a characterisation of every single stroke (position within the character, shape, direction etc.). I don't think anyone as every done this (I'd be most interested if anyone tells me otherwise).

  • You'd need to put the strokes or components into an order. The obvious candidate is the canonical stroke order of the character, which is described in lexica, and there are even dictionary websites with animated stroke order diagrams. However, the data sources I know (for Japanese), generate these animations as sequences of bitmap graphics; I have never seen human or machine readable codes that represent the sequence of strokes (or even the names of individual strokes) in a form that is suitable for edit distance calculation.

One final thing you could try, though, is to render the character glyphs and calculate the edit distance based on how many pixels (or vectors) need to be changed to turn one character into another. I once did this for Latin characters and character combinations (on pixel basis) in the context of OCR post-correction, and the results were quite encouraging.


A quick answer to larsmans comment below: There are two related concepts defined by the Unicode Standard (in the below I refer to the 6.0 version, chapter 12):

  1. An index based on radicals and stroke counts. Each Han character consists of several components, one of which is the radical. A radical/stroke count index is a character list sorted by radical (i.e. all characters that share the same radical grouped together), and each radical-specific group internally sorted by the number of strokes used in the rest of the character. Unfortunately, even this is not uniquely defined – there are characters whose radical is defined differently by different traditional lexica, and stroke counting can also be difficult. Here is what the Unicode Standard says:

    To expedite locating specific Han ideographic characters in the code charts, radical-stroke indices are provided on the Unicode web site. [...] The most influential authority for radical-stroke information is the eighteenth-century KangXi dictionary, which contains 214 radicals. The main problem in using KangXi radicals today is that many simplified characters are difficult to classify under any of the 214 KangXi radicals. As a result, various modern radical sets have been introduced. None, however, is in general use, and the 214 KangXi radicals remain the best known. [...] The Unicode radical-stroke charts are based on the KangXi radicals. The Unicode Standard follows a number of different sources for radical-stroke classification. Where two sources are at odds as to radical or stroke count for a given character, the character is shown in both positions in the radical-stroke charts.

    Note that even if we assume the radical/stroke index to be unambiguous and correct, it wouldn't suffice as a source of information to transform a character into a sequence of components, because the only component of the character fully described by this is the radical.

  2. Ideographic description sequences (section 12.2): Unicode defines code points for the basic components of characters (most of them can themselves be used as standalone characters anyway), and there are codepoints used to glue those together to form a sequence of components that describes the composition of a more complex character. So this works in a way similar to combining characters, but there are important differences:

    1. The order of components is not uniquely defined
    2. There is no definition of a rendering mechanism for such sequences
    3. There is no mapping from ordinary characters to corresponding ideographic description sequences (although the Standard mentions that such mappings, to some extent, exist in the sources they used to compile the Han character set).

    The Standard suggests that ideographic description sequences be used to describe complex or rare charactes that are not represented by any existing code point; but it explicitly discourages the use of description sequences in place of ordinary characters:

    In particular, Ideographic Description Sequences should not be used to provide alternative graphic representations of encoded ideographs in data interchange. Searching, collation, and other content-based text operations would then fail.

like image 146
jogojapan Avatar answered Nov 16 '22 02:11

jogojapan