Given a block of text that's known to be Chinese and encoded in UTF-8, is there a way to determine if it's Simplified or Traditional?

I don't know if this will work, but I'd try using iconv to see if it will translate between the charsets correctly, comparing the results from the same conversion with //TRANSLIT and //IGNORE. If the two results match, then the charset conversion hasn't encountered any characters that fail to translate, so you should have a match. <pre class="prettyprint"><code>$test1 = iconv("UTF-8", "big5//TRANSLIT", $text); $test2 = iconv("UTF-8", "big5//IGNORE", $text); if ($test1 == $test2) { echo 'traditional'; } else { $test3 = iconv("UTF-8", "gb2312//TRANSLIT", $text); $test4 = iconv("UTF-8", "gb2312//IGNORE", $text); if ($test3 == $test4) { echo 'simplified'; } else { echo 'Failed to match either traditional or simplified'; } } </code></pre>

Since <code>big5</code> and <code>gb2312</code> omit quite a few commonly used variants that are present in Unicode, the code rely on exact match between the <code>translit</code> and <code>ignore</code> modes would fail in quite a lot of normal use cases: it would fail to identify <code>説話</code> as Traditional Chinese despite <code>説</code> being a common variant in Hong Kong for <code>說</code> which is used in <code>big5</code>. A simple fix is to do it in a fuzzy way: <pre class="prettyprint"><code>$test1 = iconv("UTF-8", "big5//IGNORE", $text); $test2 = iconv("UTF-8", "gb2312//IGNORE", $text); $len1 = mb_strlen($test1); $len2 = mb_strlen($test2); $len0 = mb_strlen($text) * 0.8; // threshold if ($len1 > $len2 && $len1 > $len0) { return 'Likely Traditional'; } if ($len2 > $len1 && $len2 > $len0) { return 'Likely Simplified'; } return 'Could not identify'; </code></pre>

Recognizing text as Simplified vs. Traditional Chinese

2 Answers

I don't know if this will work, but I'd try using iconv to see if it will translate between the charsets correctly, comparing the results from the same conversion with //TRANSLIT and //IGNORE. If the two results match, then the charset conversion hasn't encountered any characters that fail to translate, so you should have a match.

$test1 = iconv("UTF-8", "big5//TRANSLIT", $text);
$test2 = iconv("UTF-8", "big5//IGNORE", $text);
if ($test1 == $test2) {
   echo 'traditional';
} else {
   $test3 = iconv("UTF-8", "gb2312//TRANSLIT", $text);
   $test4 = iconv("UTF-8", "gb2312//IGNORE", $text);
   if ($test3 == $test4) {
      echo 'simplified';
   } else {
      echo 'Failed to match either traditional or simplified';
   }
}

113

answered Sep 20 '22 17:09

Mark Baker

Since big5 and gb2312 omit quite a few commonly used variants that are present in Unicode, the code rely on exact match between the translit and ignore modes would fail in quite a lot of normal use cases: it would fail to identify 説話 as Traditional Chinese despite 説 being a common variant in Hong Kong for 說 which is used in big5.

A simple fix is to do it in a fuzzy way:

$test1 = iconv("UTF-8", "big5//IGNORE", $text);
$test2 = iconv("UTF-8", "gb2312//IGNORE", $text);
$len1 = mb_strlen($test1);
$len2 = mb_strlen($test2);
$len0 = mb_strlen($text) * 0.8; // threshold
if ($len1 > $len2 && $len1 > $len0) {
    return 'Likely Traditional';
}
if ($len2 > $len1 && $len2 > $len0) {
    return 'Likely Simplified';
}
return 'Could not identify';

answered Sep 19 '22 17:09

Henry

Related questions
                            
                                How to interface with PKCS#11 compliant HSM device in PHP
                            
                                Unzipping larger files with PHP
                            
                                How can I create a Crypt::RSA object from modulus, exponent, and private exponent?
                            
                                Write an array to config in Codeigniter?
                            
                                nested foreach with iterator interface
                            
                                How to parse actual HTML from page using CURL?
                            
                                Sorting a PHP Array into Columns
                            
                                Apache / PHP on Windows crashes with regular expression
                            
                                unixODBC Freetds PHP Problem
                            
                                Fingerprint authentication for php web app [closed]
                            
                                Inheritance under the hood
                            
                                Editable timetable using Drupal
                            
                                PHP mb_ereg_replace not replacing while preg_replace works as intended
                            
                                Is comparing a variable to $_SERVER['PHP_SELF'] safe usage?
                            
                                How do I redistribute an array into another array of a certain "shape". PHP
                            
                                Which image formats contain meta data, and how can I clear it in PHP?
                            
                                Tools for analyzing cachegrind files in aggregate?
                            
                                Apache - handling TCP connections, but not HTTP requests
                            
                                Sync large local DB with server DB (MySQL)
                            
                                Method return status: bool, string, const... (PHP)

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Recognizing text as Simplified vs. Traditional Chinese

Tags:

php

unicode

cjk

language-detection

philfreo

People also ask

2 Answers

Mark Baker

Henry

Recent Activity

Donate For Us