I am writing a java application; but stuck on this point.
Basically I have a string of Chinese characters with ALSO some possible Latin chars or numbers, lets say:
查詢促進民間參與公共建設法(210BOT法).
I want to split those Chinese chars except the Latin or numbers as "BOT" above. So, at the end I will have this kind of list:
[ 查, 詢, 促, 進, 民, 間, 參, 與, 公, 共, 建, 設, 法, (, 210, BOT, 法, ), ., ]
How can I resolve this problem (for java)?
Chinese characters lies within certain Unicode ranges:
So all you basically need to do is to check if the character's codepoint lies within the known ranges. This example is a good starting point to write a stackbased parser/splitter, you only need to extend it to separate digits from latin letters, which should be obvious enough (hint: Character#isDigit()
):
Set<UnicodeBlock> chineseUnicodeBlocks = new HashSet<UnicodeBlock>() {{
add(UnicodeBlock.CJK_COMPATIBILITY);
add(UnicodeBlock.CJK_COMPATIBILITY_FORMS);
add(UnicodeBlock.CJK_COMPATIBILITY_IDEOGRAPHS);
add(UnicodeBlock.CJK_COMPATIBILITY_IDEOGRAPHS_SUPPLEMENT);
add(UnicodeBlock.CJK_RADICALS_SUPPLEMENT);
add(UnicodeBlock.CJK_SYMBOLS_AND_PUNCTUATION);
add(UnicodeBlock.CJK_UNIFIED_IDEOGRAPHS);
add(UnicodeBlock.CJK_UNIFIED_IDEOGRAPHS_EXTENSION_A);
add(UnicodeBlock.CJK_UNIFIED_IDEOGRAPHS_EXTENSION_B);
add(UnicodeBlock.KANGXI_RADICALS);
add(UnicodeBlock.IDEOGRAPHIC_DESCRIPTION_CHARACTERS);
}};
String mixedChinese = "查詢促進民間參與公共建設法(210BOT法)";
for (char c : mixedChinese.toCharArray()) {
if (chineseUnicodeBlocks.contains(UnicodeBlock.of(c))) {
System.out.println(c + " is chinese");
} else {
System.out.println(c + " is not chinese");
}
}
Good luck.
Diclaimer: I'm a complete Lucene newbie.
Using the latest version of Lucene (3.6.0 at the time of writing) I manage to get close to the result you require.
Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_36, Collections.emptySet());
List<String> words = new ArrayList<String>();
TokenStream tokenStream = analyzer.tokenStream("content", new StringReader(original));
CharTermAttribute termAttribute = tokenStream.addAttribute(CharTermAttribute.class);
try {
tokenStream.reset(); // Resets this stream to the beginning. (Required)
while (tokenStream.incrementToken()) {
words.add(termAttribute.toString());
}
tokenStream.end(); // Perform end-of-stream operations, e.g. set the final offset.
}
finally {
tokenStream.close(); // Release resources associated with this stream.
}
The result I get is:
[查, 詢, 促, 進, 民, 間, 參, 與, 公, 共, 建, 設, 法, 210bot, 法]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With