Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using boost::locale/ICU boundary analysis with Chinese

Using the sample code from the boost::locale documentation, I can't get the following to correctly tokenize Chinese text:

using namespace boost::locale::boundary;
boost::locale::generator gen;
std::string text="中華人民共和國";
ssegment_index map(word,text.begin(),text.end(),gen("zh_CN.UTF-8")); 
for(ssegment_index::iterator it=map.begin(),e=map.end();it!=e;++it)
    std::cout <<"\""<< * it << "\", ";
std::cout << std::endl;

This splits 中華人民共和國 into seven distinct characters 中/華/人/民/共/和/國, rather than 中華/人民/共和國 as expected. The documentation of ICU, which Boost is compiled against, claims that Chinese should work out of the box and use a dictionary-based tokenizer to split phrases correctly. Using the example Japanese test phrase "生きるか死ぬか、それが問題だ。" in the code above with the "ja_JP.UTF-8" locale does work, but this tokenization does not depend on a dictionary, only on kanji/kana boundaries.

I've tried the same code directly in ICU as suggested here, but the results are the same.

UnicodeString text = "中華人民共和國";
UErrorCode status = U_ZERO_ERROR;
BreakIterator* bi = BreakIterator::createWordInstance(Locale::getChinese(), status);
bi->setText(text);
int32_t p = bi->first();
while (p != BreakIterator::DONE) {
    printf("Boundary at position %d\n", p);
    p = bi->next();
}
delete bi;

Any idea what I'm doing wrong?

like image 380
Uri Granta Avatar asked Mar 13 '15 17:03

Uri Granta


1 Answers

You most likely use an ICU version prior to 5.0, which is the first release supporting dictionary based Chinese word segmentation.

Also, note that boost by default uses ICU as the local backend, hence the mirroring results.

like image 138
andrew231 Avatar answered Sep 22 '22 14:09

andrew231