How do you sort Chinese, Japanese and Korean (CJK) characters in Perl?
As far as I can tell, sorting CJK characters by stroke count, then by radical, seems to be the way these languages are sorted. There are also some methods that sort by sounds, but this seems less common.
I've tried using:
perl -e 'print join(" ", sort qw(工 然 一 人 三 古 二 )), "\n";'
# Prints: 一 三 二 人 古 工 然 which is incorrect
And I've tried using Unicode::Collate from CPAN, but it says:
By default, CJK Unified Ideographs are ordered in Unicode codepoint order...
If I could get a database of stroke count per character, I could easily sort all of the characters, but this doesn't seem to come with Perl nor is it encapsulated in any module I could find.
If you know how to sort CJK in other languages, it would be helpful to mention it in an answer to this question.
See TR38 for the dirty details and corner cases. It's not as easy as you think and as this code sample looks like.
use 5.010;
use utf8;
use Encode;
use Unicode::Unihan;
my $u = Unicode::Unihan->new;
say encode_utf8 sprintf "Character $_ has the radical #%s and %d residual strokes." , split /[.]/, $u->RSUnicode($_) for qw(工 然 一 人 三 古 二);
__END__
Character 工 has the radical #48 and 0 residual strokes.
Character 然 has the radical #86 and 8 residual strokes.
Character 一 has the radical #1 and 0 residual strokes.
Character 人 has the radical #9 and 0 residual strokes.
Character 三 has the radical #1 and 2 residual strokes.
Character 古 has the radical #30 and 2 residual strokes.
Character 二 has the radical #7 and 0 residual strokes.
See http://en.wikipedia.org/wiki/List_of_Kangxi_radicals for a mapping from radical ordinal number to stroke count.
A Japanese phonebook is sorted on a phonetic basis (gojûon collation). However, kanji character order is not based on phonetics, no matter whether in Unicode, JIS, S-JIS or EUC. Only kana are based on phonetic order. This means you can not collate meaningfully without phonetic conversion!
For example:
a) kanji: 東京駅
b) kana converted: とうきょうえき
c) romanisation: tôkyô eki
With b) or c), you can make a meaningful sort. But you can not do with only a). Of course, you can run the plain sort function, but it is not meaningful for Japanese.
Check out my rubygem toPinyin, which will convert a UTF-8 encoded chinese character to their PinYin (pronunciation). And then, a sort could be done on the Pinyin easily.
Simply, gem install toPinyin
require 'toPinyin'
words = "
人
没有
理想
跟
咸鱼
有
什么
区别
".split("\n")
words.sort! {|a ,b| a.pinyin.join <=> b.pinyin.join }
https://github.com/pierrchen/toPinyin
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With