i would like to build my own - here am not sure which one - tokenizer (from Lucene point of view) or my own analyzer. I already write a code that tokenize my documents in word (as a List < String > or a List < Word > where Word is a class with only a kind of container with 3 public String : word, pos, lemma - pos stand for part-of-speech tag).
i'm not sure what i am going to index, maybe only "Word.lemma" or something like "Word.lemma + '#' + Word.pos", probably i will do some filtering from a stop word list based on part-of-speech.
btw here is my misunderstanding : i am not sure where i should plug to the Lucene API,
should i wrap my own tokenizer inside a new tokenizer ? should i rewrite TokenStream ? should i consider that this is the job of the analyzer rather than the tokenizer ? or shoud i bypass everything and directly build my index by adding my word directly inside index, using IndexWriter, Fieldable and so on ? (if so do you know of any documentation on how to create it's own index from scratch when bypass ing the analysis process)
best regards
EDIT : may be the simplest way should be to org.apache.commons.lang.StringUtils.join my Word-s with a space on the exit of my personal tokenizer/analyzer and rely on the WhiteSpaceTokenizer to feed Lucene (and other classical filters) ?
EDIT : so, i have read EnglishLemmaTokenizer pointed by Larsmans... but where i am still confused, is the fact that i end my own analysis/tokenization process with a complete *List < Word > * (Word class wrapping .form/.pos/.lemma) , this process rely on an external binary that i had wrapped in Java (this is a must do / can not do otherwise - it is not on a consumer point of view, i get the full list as a result) and i still not see how i should wrap it again to get back to the normal Lucene analysis process.
also i will be using the TermVector feature with TF.IDF like scoring (may be redefining my own), i may also be interested in the proximty searching, thus, discarding some words from their part-of- speech before providing them to a Lucene built-in tokenizer or internal analyzer may seem a bad idea. And i have difficulties in thinking of a "proper" way to wrap a Word.form / Word.pos / Word.lemma (or even other Word.anyOtherUnterestingAttribute) to the Lucene way.
EDIT: BTW, here is a piece of code that i write inspired by the one of @Larsmans :
class MyLuceneTokenizer extends TokenStream {
private PositionIncrementAttribute posIncrement;
private CharTermAttribute termAttribute;
private List<TaggedWord> tagged;
private int position;
public MyLuceneTokenizer(Reader input, String language, String pathToExternalBinary) {
super();
posIncrement = addAttribute(PositionIncrementAttribute.class);
termAttribute = addAttribute(CharTermAttribute.class); // TermAttribute is deprecated!
// import com.google.common.io.CharStreams;
text = CharStreams.toString(input); //see http://stackoverflow.com/questions/309424/in-java-how-do-i-read-convert-an-inputstream-to-a-string
tagged = MyTaggerWrapper.doTagging(text, language, pathToExternalBinary);
position = 0;
}
public final boolean incrementToken()
throws IOException {
if (position > tagged.size() -1) {
return false;
}
int increment = 1; // will probably be changed later depending upon any POS filtering or insertion @ same place...
String form = (tagged.get(position)).word;
String pos = (tagged.get(position)).pos;
String lemma = (tagged.get(position)).lemma;
// logic filtering should be here...
// BTW we have broken the idea behing the Lucene nested filters or analyzers!
String kept = lemma;
if (kept != null) {
posIncrement.setPositionIncrement(increment);
char[] asCharArray = kept.toCharArray();
termAttribute.copyBuffer(asCharArray, 0, asCharArray.length);
//termAttribute.setTermBuffer(kept);
position++;
}
return true;
}
}
class MyLuceneAnalyzer extends Analyzer {
private String language;
private String pathToExternalBinary;
public MyLuceneAnalyzer(String language, String pathToExternalBinary) {
this.language = language;
this.pathToExternalBinary = pathToExternalBinary;
}
@Override
public TokenStream tokenStream(String fieldname, Reader input) {
return new MyLuceneTokenizer(input, language, pathToExternalBinary);
}
}
There are various options here, but when I tried to wrap a POS tagger in Lucene, I found that implementing a new TokenStream
and wrapping that inside a new Analyzer
was the easiest option. In any case, mucking with IndexWriter
directly seems like a bad idea. You can find my code on my GitHub.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With