Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do you sort CJK (Asian) characters in Perl, or with any other programming language?

How do you sort Chinese, Japanese and Korean (CJK) characters in Perl?

As far as I can tell, sorting CJK characters by stroke count, then by radical, seems to be the way these languages are sorted. There are also some methods that sort by sounds, but this seems less common.

I've tried using:

perl -e 'print join(" ", sort qw(工 然 一 人 三 古 二 )), "\n";'
# Prints: 一 三 二 人 古 工 然 which is incorrect

And I've tried using Unicode::Collate from CPAN, but it says:

By default, CJK Unified Ideographs are ordered in Unicode codepoint order...

If I could get a database of stroke count per character, I could easily sort all of the characters, but this doesn't seem to come with Perl nor is it encapsulated in any module I could find.

If you know how to sort CJK in other languages, it would be helpful to mention it in an answer to this question.

like image 495
Neil Avatar asked Oct 08 '10 14:10

Neil


3 Answers

See TR38 for the dirty details and corner cases. It's not as easy as you think and as this code sample looks like.

use 5.010;
use utf8;
use Encode;
use Unicode::Unihan;
my $u = Unicode::Unihan->new;

say encode_utf8 sprintf "Character $_ has the radical #%s and %d residual strokes." , split /[.]/, $u->RSUnicode($_) for qw(工 然 一 人 三 古 二);
__END__
Character 工 has the radical #48 and 0 residual strokes.
Character 然 has the radical #86 and 8 residual strokes.
Character 一 has the radical #1 and 0 residual strokes.
Character 人 has the radical #9 and 0 residual strokes.
Character 三 has the radical #1 and 2 residual strokes.
Character 古 has the radical #30 and 2 residual strokes.
Character 二 has the radical #7 and 0 residual strokes.

See http://en.wikipedia.org/wiki/List_of_Kangxi_radicals for a mapping from radical ordinal number to stroke count.

like image 180
daxim Avatar answered Sep 22 '22 05:09

daxim


A Japanese phonebook is sorted on a phonetic basis (gojûon collation). However, kanji character order is not based on phonetics, no matter whether in Unicode, JIS, S-JIS or EUC. Only kana are based on phonetic order. This means you can not collate meaningfully without phonetic conversion!

For example:

a) kanji:           東京駅
b) kana converted:  とうきょうえき
c) romanisation:    tôkyô eki

With b) or c), you can make a meaningful sort. But you can not do with only a). Of course, you can run the plain sort function, but it is not meaningful for Japanese.

like image 43
kmugitani Avatar answered Sep 24 '22 05:09

kmugitani


Check out my rubygem toPinyin, which will convert a UTF-8 encoded chinese character to their PinYin (pronunciation). And then, a sort could be done on the Pinyin easily.

Simply, gem install toPinyin

require 'toPinyin'

words = "
人
没有
理想
跟
咸鱼
有
什么
区别
".split("\n")

words.sort! {|a ,b|   a.pinyin.join <=> b.pinyin.join }

https://github.com/pierrchen/toPinyin

like image 30
pierrotlefou Avatar answered Sep 26 '22 05:09

pierrotlefou