I want to get word count from a String. It's as simple as that. The catch is that the string can be in an unpredictable language.
So, I need a function of signature int getWordCount(String)
with the following sample output -
getWordCount("供应商代发发货") => 7
getWordCount("This is a sentence") => 4
Any help on how to proceed would be appreciated :)
The standard API provides the BreakIterator for this sort of boundary analysis but the Oracle Java 7 locale support doesn't break the sample string.
When I used the ICU4J v51.1 BreakIterator it broke the sample into [供应, 商代, 发, 发, 货]
.
// import com.ibm.icu.text.BreakIterator;
String sentence = "\u4f9b\u5e94\u5546\u4ee3\u53d1\u53d1\u8d27";
BreakIterator iterator = BreakIterator.getWordInstance(Locale.CHINESE);
iterator.setText(sentence);
List<String> words = new ArrayList<>();
int start = iterator.first();
int end = iterator.next();
while (end != BreakIterator.DONE) {
words.add(sentence.substring(start, end));
start = end;
end = iterator.next();
}
System.out.println(words);
Note: I used Google Translate to guess that "供应商代发发货" was Chinese. Obviously, I don't speak the language so can't comment on the correctness of the output.
The concept of "word" may be trivial or complex. Here is Apache Stanbol Toolkit:
Word Tokenization: The detection of single words is required by the Stanbol Enhancer to process text. While this is trivial for most languages it is a rather complex task for some eastern languages, e.g. Chinese, Japanese, Korean. If not otherwise configured, Stanbol will use whitespaces to tokenize words.
So if the concept of word is linguistic, rather than syntactic, you should use a NLP toolkit
My preferred Java solution is Apache's Open NLP
NOTE: I have used http://www.mdbg.net/chindict/chindict.php?page=worddict to tokenize your example. It implies there are 4 words not seven. I have cut and pasted (rather fragmented):
Original Text Simplified Pīnyīn English definition Add a new word to the dictionary Traditional HSK 供应商 供应商 gōngyìngshāng
supplier
供應商
代
代
dài
to substitute / to act on behalf of others / to replace / generation / dynasty / age / period / (historical) era / (geological) eon
发
发
fā
to send out / to show (one's feeling) / to issue / to develop / classifier for gunshots (rounds)
發 HSK 4
发 fà
hair / Taiwan pr. [fa3]
髮
发货
发货
fāhuò
to dispatch / to send out goods
發貨
These first three characters appear to form a single word.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With