I found this question which gives me the ability to check if a string contains a Chinese character. I'm not sure if the unicode ranges are correct but they seem to return false for Japanese and Korean and true for Chinese. What it doesn't do is tell if the character is traditional or simplified Chinese. How would you go about finding this out? <hr> <h3>update</h3> <blockquote> Q: How can I recognize from the 32 bit value of a Unicode character if this is a Chinese, Korean or Japanese character? </blockquote> http://unicode.org/faq/han_cjk.html Their argument that the characters regardless of their shape have the same meaning and therefore should be represented by the same code. Well, it's not meaningless to me because I am analyzing individual characters which doesn't work with their solution: <blockquote> A better solution is to look at the text as a whole: if there's a fair amount of kana, it's probably Japanese, and if there's a fair amount of hangul, it's probably Korean. </blockquote>

It is possible for some characters. The Traditional and Simplified character sets overlap, so you have basically three sets of characters: <ol> <li>Characters that are traditional only.</li> <li>Characters that are simplified only.</li> <li>Characters that have been left untouched, and are available in both.</li> </ol> Take the character 面 for instance. It belongs both to #2 and #3... As a simplified character, it stands for 面 and 麵, face and noodles. Whereas 麵 is a traditional character only. So in the Unihan database, 麵 has a <code>kSimplifiedVariant</code>, which points to 面. So you can deduct that it is a traditional character only. But 面 also has a <code>kTraditionalVariant</code>, which points to 麵. This is where the system breaks: if you use this data to deduct that 面 is a simplified character only, you'd be wrong... On the other hand, 韩 has a <code>kTraditionalVariant</code>, pointing to 韓, and these two are a "real" Simplified/Traditional pair. But nothing in the Unihan database differentiates cases like 韓/韩 from cases like 麵/面.

Detect if character is simplified or traditional Chinese character

Tags:

unicode

cjk

I found this question which gives me the ability to check if a string contains a Chinese character. I'm not sure if the unicode ranges are correct but they seem to return false for Japanese and Korean and true for Chinese.

What it doesn't do is tell if the character is traditional or simplified Chinese. How would you go about finding this out?

update

Q: How can I recognize from the 32 bit value of a Unicode character if this is a Chinese, Korean or Japanese character?

http://unicode.org/faq/han_cjk.html

Their argument that the characters regardless of their shape have the same meaning and therefore should be represented by the same code. Well, it's not meaningless to me because I am analyzing individual characters which doesn't work with their solution:

A better solution is to look at the text as a whole: if there's a fair amount of kana, it's probably Japanese, and if there's a fair amount of hangul, it's probably Korean.

761

asked Jan 06 '11 20:01

thenengah

2 Answers

As already stated, you can't reliably detect the script style from a single character, but it is possible for a sufficiently long sample of text. See https://github.com/jpatokal/script_detector for a Ruby gem that does the job, and Simplified Chinese Unicode table for a general discussion.

answered Oct 15 '22 11:10

lambshaanxy

It is possible for some characters. The Traditional and Simplified character sets overlap, so you have basically three sets of characters:

Characters that are traditional only.
Characters that are simplified only.
Characters that have been left untouched, and are available in both.

Take the character 面 for instance. It belongs both to #2 and #3... As a simplified character, it stands for 面 and 麵, face and noodles. Whereas 麵 is a traditional character only. So in the Unihan database, 麵 has a kSimplifiedVariant, which points to 面. So you can deduct that it is a traditional character only.

But 面 also has a kTraditionalVariant, which points to 麵. This is where the system breaks: if you use this data to deduct that 面 is a simplified character only, you'd be wrong...

On the other hand, 韩 has a kTraditionalVariant, pointing to 韓, and these two are a "real" Simplified/Traditional pair. But nothing in the Unihan database differentiates cases like 韓/韩 from cases like 麵/面.

answered Oct 15 '22 10:10

dda

Related questions
                            
                                sort() for Japanese
                            
                                Is the byte order marker really a valid identifier?
                            
                                How to make GNU screen recognize UTF-8 characters [closed]
                            
                                Tabulating characters with diacritics in R
                            
                                is there a downside to putting N in front of strings in scripts? Is it considered a "best practice"?
                            
                                How can one find the Unicode codepoints that a font has glyphs for, on a Debian-based system?
                            
                                How well is Node.js' support for Unicode?
                            
                                What's allowed in a Perl 6 identifier?
                            
                                How to do Unicode escaping in YAML multiline string?
                            
                                Python 3 UnicodeDecodeError - How do I debug UnicodeDecodeError?
                            
                                What character can I use to display the "four arrows icon"?
                            
                                Is It Safe To Use Unicode Literals in HTML?
                            
                                What's the right way to use Unicode metadata in setup.py?
                            
                                Best way to convert between [Char] and [Word8]?
                            
                                What Unicode characters are dangerous?
                            
                                How to decode encodeURIComponent in GAE (python)?
                            
                                How do I get the "visible" length of a combining Unicode string in Python?
                            
                                How do I eliminate TT's "Wide character in print" warning?
                            
                                How to handle unicode character sequences in C/C++?
                            
                                unicode support in android ndk

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With