Using boost::locale/ICU boundary analysis with Chinese

Question

Using the sample code from the boost::locale documentation, I can't get the following to correctly tokenize Chinese text:

using namespace boost::locale::boundary;
boost::locale::generator gen;
std::string text="中華人民共和國";
ssegment_index map(word,text.begin(),text.end(),gen("zh_CN.UTF-8")); 
for(ssegment_index::iterator it=map.begin(),e=map.end();it!=e;++it)
    std::cout <<"\""<< * it << "\", ";
std::cout << std::endl;

This splits 中華人民共和國 into seven distinct characters 中/華/人/民/共/和/國, rather than 中華/人民/共和國 as expected. The documentation of ICU, which Boost is compiled against, claims that Chinese should work out of the box and use a dictionary-based tokenizer to split phrases correctly. Using the example Japanese test phrase "生きるか死ぬか、それが問題だ。" in the code above with the "ja_JP.UTF-8" locale does work, but this tokenization does not depend on a dictionary, only on kanji/kana boundaries.

I've tried the same code directly in ICU as suggested here, but the results are the same.

UnicodeString text = "中華人民共和國";
UErrorCode status = U_ZERO_ERROR;
BreakIterator* bi = BreakIterator::createWordInstance(Locale::getChinese(), status);
bi->setText(text);
int32_t p = bi->first();
while (p != BreakIterator::DONE) {
    printf("Boundary at position %d
", p);
    p = bi->next();
}
delete bi;

Any idea what I'm doing wrong?

andrew231 · Accepted Answer

You most likely use an ICU version prior to 5.0, which is the first release supporting dictionary based Chinese word segmentation.

Also, note that boost by default uses ICU as the local backend, hence the mirroring results.

Using boost::locale/ICU boundary analysis with Chinese

Tags:

c++

boost

chinese-locale

icu

boost-locale

Uri Granta

1 Answers

andrew231

Recent Activity

Donate For Us

Using boost::locale/ICU boundary analysis with Chinese

Tags:

c++

boost

chinese-locale

icu

boost-locale

Uri Granta

1 Answers

andrew231

Related questions

Recent Activity

Donate For Us