Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Get word count from a string in Unicode (in any language)

I want to get word count from a String. It's as simple as that. The catch is that the string can be in an unpredictable language.

So, I need a function of signature int getWordCount(String) with the following sample output -

getWordCount("供应商代发发货") => 7
getWordCount("This is a sentence") => 4

Any help on how to proceed would be appreciated :)

like image 935
jaibatrik Avatar asked May 19 '13 17:05

jaibatrik


2 Answers

The standard API provides the BreakIterator for this sort of boundary analysis but the Oracle Java 7 locale support doesn't break the sample string.

When I used the ICU4J v51.1 BreakIterator it broke the sample into [供应, 商代, 发, 发, 货].

// import com.ibm.icu.text.BreakIterator;
String sentence = "\u4f9b\u5e94\u5546\u4ee3\u53d1\u53d1\u8d27";
BreakIterator iterator = BreakIterator.getWordInstance(Locale.CHINESE);
iterator.setText(sentence);

List<String> words = new ArrayList<>();
int start = iterator.first();
int end = iterator.next();
while (end != BreakIterator.DONE) {
  words.add(sentence.substring(start, end));
  start = end;
  end = iterator.next();
}
System.out.println(words);

Note: I used Google Translate to guess that "供应商代发发货" was Chinese. Obviously, I don't speak the language so can't comment on the correctness of the output.

like image 109
McDowell Avatar answered Oct 22 '22 12:10

McDowell


The concept of "word" may be trivial or complex. Here is Apache Stanbol Toolkit:

Word Tokenization: The detection of single words is required by the Stanbol Enhancer to process text. While this is trivial for most languages it is a rather complex task for some eastern languages, e.g. Chinese, Japanese, Korean. If not otherwise configured, Stanbol will use whitespaces to tokenize words.

So if the concept of word is linguistic, rather than syntactic, you should use a NLP toolkit

My preferred Java solution is Apache's Open NLP

NOTE: I have used http://www.mdbg.net/chindict/chindict.php?page=worddict to tokenize your example. It implies there are 4 words not seven. I have cut and pasted (rather fragmented):

Original Text Simplified Pīnyīn English definition Add a new word to the dictionary Traditional HSK 供应商 供应商 gōng​yìng​shāng​

supplier

供應商 代
代 dài​

to substitute / to act on behalf of others / to replace / generation / dynasty / age / period / (historical) era / (geological) eon


发 fā​

to send out / to show (one's feeling) / to issue / to develop / classifier for gunshots (rounds)

發 HSK 4

发 fà​

hair / Taiwan pr. [fa3]

髮 发货
发货 fā​huò​

to dispatch / to send out goods

發貨

These first three characters appear to form a single word.

like image 28
peter.murray.rust Avatar answered Oct 22 '22 14:10

peter.murray.rust