Given a block of text that's known to be Chinese and encoded in UTF-8, is there a way to determine if it's Simplified or Traditional?
The most obvious difference between traditional Chinese and simplified Chinese is the way that the characters look. Traditional characters are typically more complicated and have more strokes, while simplified characters are, as the name suggests, simpler and have fewer strokes.
Typically, the rule of thumb is that if your target audience is in Mainland China, Singapore, or Malaysia, you should always use “Simplified” Chinese characters. The “Simplified” character system is also the version utilized officially by the United Nations (UN).
Generally speaking, it's much easier for someone who reads Traditional Chinese to read Simplified Chinese than other way around. As a result, it is not uncommon for people who cannot communicate verbally in Chinese to be able to understand each other through writing.
I don't know if this will work, but I'd try using iconv to see if it will translate between the charsets correctly, comparing the results from the same conversion with //TRANSLIT and //IGNORE. If the two results match, then the charset conversion hasn't encountered any characters that fail to translate, so you should have a match.
$test1 = iconv("UTF-8", "big5//TRANSLIT", $text);
$test2 = iconv("UTF-8", "big5//IGNORE", $text);
if ($test1 == $test2) {
echo 'traditional';
} else {
$test3 = iconv("UTF-8", "gb2312//TRANSLIT", $text);
$test4 = iconv("UTF-8", "gb2312//IGNORE", $text);
if ($test3 == $test4) {
echo 'simplified';
} else {
echo 'Failed to match either traditional or simplified';
}
}
Since big5
and gb2312
omit quite a few commonly used variants that are present in Unicode, the code rely on exact match between the translit
and ignore
modes would fail in quite a lot of normal use cases: it would fail to identify 説話
as Traditional Chinese despite 説
being a common variant in Hong Kong for 說
which is used in big5
.
A simple fix is to do it in a fuzzy way:
$test1 = iconv("UTF-8", "big5//IGNORE", $text);
$test2 = iconv("UTF-8", "gb2312//IGNORE", $text);
$len1 = mb_strlen($test1);
$len2 = mb_strlen($test2);
$len0 = mb_strlen($text) * 0.8; // threshold
if ($len1 > $len2 && $len1 > $len0) {
return 'Likely Traditional';
}
if ($len2 > $len1 && $len2 > $len0) {
return 'Likely Simplified';
}
return 'Could not identify';
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With