How can I determine Levenshtein distance for Mandarin Chinese characters?

Tags:

We are developing a system to do fuzzy matching on over 50 international languages using the UTF-8, UTF-16, and UTF-32 Unicode character standard. So far, we have been able to use Levenshtein distance to detect misspellings of German Unicode extended character words.

We would like to extend this system to handle Mandarin Chinese ideographs represented in Unicode. How would we perform Levenshtein distance calculation between similar Chinese characters?

769

asked Sep 12 '12 02:09

Frank

1 Answers

Firstly, just to clarify: A Chinese character is not as such equivalent to a German or English word. Most of the things you'd consider as words (using a semantic or syntactic definition of "word") consist of 1-3 characters. It is straightforward to apply Levenshtein distance to such character sequences by representing them as sequences of UCS-2 or UCS-4 code points. As most words are short (esp. words of length 1 or 2 characters), it may be of limited use, though.

However, as your question is specifically about the edit distance between individual characters, I believe a different approach is required, and it may be very difficult indeed.

For a start, you'd have to represent each character as a sequence of the components / strokes it consists of. There are two problems:

Some components consist themselves of even smaller components, so how to break a character down into "atomic" components is not uniquely defined. If you do it down to the level of individual strokes, you'd need a characterisation of every single stroke (position within the character, shape, direction etc.). I don't think anyone as every done this (I'd be most interested if anyone tells me otherwise).
You'd need to put the strokes or components into an order. The obvious candidate is the canonical stroke order of the character, which is described in lexica, and there are even dictionary websites with animated stroke order diagrams. However, the data sources I know (for Japanese), generate these animations as sequences of bitmap graphics; I have never seen human or machine readable codes that represent the sequence of strokes (or even the names of individual strokes) in a form that is suitable for edit distance calculation.

One final thing you could try, though, is to render the character glyphs and calculate the edit distance based on how many pixels (or vectors) need to be changed to turn one character into another. I once did this for Latin characters and character combinations (on pixel basis) in the context of OCR post-correction, and the results were quite encouraging.

A quick answer to larsmans comment below: There are two related concepts defined by the Unicode Standard (in the below I refer to the 6.0 version, chapter 12):

An index based on radicals and stroke counts. Each Han character consists of several components, one of which is the radical. A radical/stroke count index is a character list sorted by radical (i.e. all characters that share the same radical grouped together), and each radical-specific group internally sorted by the number of strokes used in the rest of the character. Unfortunately, even this is not uniquely defined – there are characters whose radical is defined differently by different traditional lexica, and stroke counting can also be difficult. Here is what the Unicode Standard says:

To expedite locating specific Han ideographic characters in the code charts, radical-stroke indices are provided on the Unicode web site. [...] The most influential authority for radical-stroke information is the eighteenth-century KangXi dictionary, which contains 214 radicals. The main problem in using KangXi radicals today is that many simplified characters are difficult to classify under any of the 214 KangXi radicals. As a result, various modern radical sets have been introduced. None, however, is in general use, and the 214 KangXi radicals remain the best known. [...] The Unicode radical-stroke charts are based on the KangXi radicals. The Unicode Standard follows a number of different sources for radical-stroke classification. Where two sources are at odds as to radical or stroke count for a given character, the character is shown in both positions in the radical-stroke charts.

Note that even if we assume the radical/stroke index to be unambiguous and correct, it wouldn't suffice as a source of information to transform a character into a sequence of components, because the only component of the character fully described by this is the radical.
Ideographic description sequences (section 12.2): Unicode defines code points for the basic components of characters (most of them can themselves be used as standalone characters anyway), and there are codepoints used to glue those together to form a sequence of components that describes the composition of a more complex character. So this works in a way similar to combining characters, but there are important differences:
1. The order of components is not uniquely defined
2. There is no definition of a rendering mechanism for such sequences
3. There is no mapping from ordinary characters to corresponding ideographic description sequences (although the Standard mentions that such mappings, to some extent, exist in the sources they used to compile the Han character set).
The Standard suggests that ideographic description sequences be used to describe complex or rare charactes that are not represented by any existing code point; but it explicitly discourages the use of description sequences in place of ordinary characters:

In particular, Ideographic Description Sequences should not be used to provide alternative graphic representations of encoded ideographs in data interchange. Searching, collation, and other content-based text operations would then fail.

146

answered Nov 16 '22 02:11

jogojapan

Related questions
                            
                                Stealing resources from std::map's keys allowed?
                            
                                Template member function specialization in a template class
                            
                                Executing a generic function every N times with a static variable
                            
                                Compiler Error: Function call with parameters that may be unsafe
                            
                                Taking advantage of SSE and other CPU extensions
                            
                                Creating Qt models for tree views
                            
                                How to create a virtual file?
                            
                                Why does using the same count variable name in nested FOR loops work?
                            
                                Visual C++ 2010 atomic types support?
                            
                                Why can operator-> be overloaded manually?
                            
                                non-copying istringstream
                            
                                Is there a list of STL container methods that may throw an exception anywhere?
                            
                                How can I use Boost.Test in a CMake based project?
                            
                                Implementation communication protocols in C/C++
                            
                                Iterating Variadic Macro Arguments
                            
                                invalid use of incomplete type / forward declaration
                            
                                Usage example of boost::condition::timed_wait
                            
                                expected constructor, destructor, or type conversion before ‘(’ token
                            
                                typesafe typedef in C++
                            
                                Is char default-promoted?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How can I determine Levenshtein distance for Mandarin Chinese characters?

Tags:

c++

unicode

levenshtein-distance

cjk

edit-distance

Frank

People also ask

1 Answers

jogojapan

Recent Activity

Donate For Us