Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Recognizing text as Simplified vs. Traditional Chinese

Given a block of text that's known to be Chinese and encoded in UTF-8, is there a way to determine if it's Simplified or Traditional?

like image 776
philfreo Avatar asked Nov 02 '10 23:11

philfreo


People also ask

How do you know if a text is simplified or Traditional Chinese?

The most obvious difference between traditional Chinese and simplified Chinese is the way that the characters look. Traditional characters are typically more complicated and have more strokes, while simplified characters are, as the name suggests, simpler and have fewer strokes.

Should I translate into simplified or Traditional Chinese?

Typically, the rule of thumb is that if your target audience is in Mainland China, Singapore, or Malaysia, you should always use “Simplified” Chinese characters. The “Simplified” character system is also the version utilized officially by the United Nations (UN).

Can Traditional Chinese understand simplified Chinese?

Generally speaking, it's much easier for someone who reads Traditional Chinese to read Simplified Chinese than other way around. As a result, it is not uncommon for people who cannot communicate verbally in Chinese to be able to understand each other through writing.


2 Answers

I don't know if this will work, but I'd try using iconv to see if it will translate between the charsets correctly, comparing the results from the same conversion with //TRANSLIT and //IGNORE. If the two results match, then the charset conversion hasn't encountered any characters that fail to translate, so you should have a match.

$test1 = iconv("UTF-8", "big5//TRANSLIT", $text);
$test2 = iconv("UTF-8", "big5//IGNORE", $text);
if ($test1 == $test2) {
   echo 'traditional';
} else {
   $test3 = iconv("UTF-8", "gb2312//TRANSLIT", $text);
   $test4 = iconv("UTF-8", "gb2312//IGNORE", $text);
   if ($test3 == $test4) {
      echo 'simplified';
   } else {
      echo 'Failed to match either traditional or simplified';
   }
}
like image 113
Mark Baker Avatar answered Sep 20 '22 17:09

Mark Baker


Since big5 and gb2312 omit quite a few commonly used variants that are present in Unicode, the code rely on exact match between the translit and ignore modes would fail in quite a lot of normal use cases: it would fail to identify 説話 as Traditional Chinese despite being a common variant in Hong Kong for which is used in big5.

A simple fix is to do it in a fuzzy way:

$test1 = iconv("UTF-8", "big5//IGNORE", $text);
$test2 = iconv("UTF-8", "gb2312//IGNORE", $text);
$len1 = mb_strlen($test1);
$len2 = mb_strlen($test2);
$len0 = mb_strlen($text) * 0.8; // threshold
if ($len1 > $len2 && $len1 > $len0) {
    return 'Likely Traditional';
}
if ($len2 > $len1 && $len2 > $len0) {
    return 'Likely Simplified';
}
return 'Could not identify';
like image 22
Henry Avatar answered Sep 19 '22 17:09

Henry