Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is there something better than the kakasi library for gojûon collation?

"Better" primarily means accuracy, but I am also interested in any other criteria in which other systems excel. I sampled the Perl binding Text::Kakasi for correctness in an admittedly limited fashion and it works just fine for our needs.

use utf8;
use Encode;
use Text::Kakasi;
use Unicode::Collate;

my $k = Text::Kakasi->new(qw(-iutf8 -outf8 -JH));
my $c = Unicode::Collate->new;

print encode_utf8 $_ for
    map  { $_->[0] }
    sort { $c->cmp($a->[1], $b->[1]) }
    map  { [$_, $k->get($_)] }
    <DATA>;

__DATA__
アメリカ合衆国
アラブ首長国連邦
ロシア連邦
中国
南アフリカ共和国
日本
北京(ペキン)
大阪
東京
like image 599
daxim Avatar asked Oct 09 '10 16:10

daxim


2 Answers

The only other (serious) open-source conversion tool I know of is N-gram, not the most explicit name... It has huge dictionaries, and might be better than Kakasi. But I haven't seen any comparisons out there.

EDIT:

I gave some thought to the notion of "betterness" of one libray over others in this context. One thing that could be done is to take the dictionaries of N-gram and run them against kakasi. If kakasi fails to convert some of N-gram's entries, it could be said that N-gram's better because its lexicon is richer -- enhancing the accuracy of the collation.

However, since the corpus of Kanji-based words (which need to be converted into kana to be collated properly) is not finite - family names among others are a big problem, as they can be read almost any way you can imagine - there can't be a solution that provides 100% coverage. But the OP asked for a "better" solution, not a perfect one...

like image 154
dda Avatar answered Sep 20 '22 13:09

dda


I am not sure about meaning of 'authoritative'.

But I can say Kakashi is well known freeware library and still not obsolete today.

If you can convert Kanji strings to Hiragana(or Katakana) strings by Kakashi, resulting sorting order would be fine.

http://www.utf8-chartable.de/unicode-utf8-table.pl

like image 29
kmugitani Avatar answered Sep 24 '22 13:09

kmugitani