I am using lucene 4.3.0 and want to tokenize the doc with both English and Japanese characters.
An example is like "LEICA S2 カタログ (新品)"
The StandardAnalyzer "[leica] [s2] [カタログ] [新] [品]"
The JapaneseAnalyzer "[leica] [s] [2] [カタログ] [新品]"
In the application of my project, the StandardAnalyzer is better on English characters, e.g. [s2] is better than [s] [2]. JapaneseAnalyzer is better on Japanese, e.g. [新品] to [新] [品]. In addition, JapaneseAnalyzer has a good feature to convert fullwidth character "2" to "2".
If I want the tokens to be [leica] [s2] [カタログ] [新品], it means:
1) English and numbers are tokenized by StandardAnalyzer. [leica] [s2]
2) Japanese are tokenized by JapaneseAnalyzer. [カタログ] [新品]
3) fullwidth character are converted to halfwidth by a filter. [s2]=>[s2]
how to implement this custom analyzer?
First thing I would try is messing with the arguments passed to the JapaneseAnalyzer, particularly the Tokenizer.Mode (I know precisely nothing about the structure of the Japanese language, so no help from me on the intent of those options).
Barring that
You'll need to create your own Analyzer for this. Unless you are willing to write your own Tokenizer, the end result may be a best effort. Creating an analyzer is pretty simple, creating a tokenizer will mean defining your own grammar, which will not be so simple.
Take a look at the code for JapaneseAnalyzer and StandardAnalyzer, particularly the call to createComponents
, which is all you need to implement to create a custom analyzer.
Say you come to conclusion the StandardTokenizer
is correct for you, but otherwise we're going to use mostly the Japanese filter set, it might look something like:
@Override
protected TokenStreamComponents createComponents(String fieldName, Reader reader) {
//For your Tokenizer, you might consider StandardTokenizer, JapaneseTokenizer, or CharTokenizer
Tokenizer tokenizer = new StandardTokenizer(version, reader);
TokenStream stream = new StandardFilter(version, tokenizer);
stream = new JapaneseBaseFormFilter(stream);
stream = new LowerCaseFilter(matchVersion, stream); //In JapaneseAnalyzer, a lowercasefilter comes at the end, further proving I don't know Japanese.
stream = new JapanesePartOfSpeechStopFilter(true, stream, stoptags);
stream = new CJKWidthFilter(stream); //Note this WidthFilter! I believe this does the char width transform you are looking for.
stream = new StopFilter(matchVersion, stream, stopwords);
stream = new JapaneseKatakanaStemFilter(stream);
stream = new PorterStemFilter(stream); //Nothing stopping you using a second stemmer, really.
return new TokenStreamComponents(tokenizer, stream);
}
That's a completely random implementation, from someone who doesn't understand the concerns, but hopefully it points the way toward implementing a more meaningful Analyzer. The order in which you apply filters in that filter chain are important, so be careful there (ie. In english, LowerCaseFilter is usually applied early, so that things like Stemmers don't have to worry about case).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With