Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Tokenization, and indexing with Lucene, how to handle external tokenize and part-of-speech?

i would like to build my own - here am not sure which one - tokenizer (from Lucene point of view) or my own analyzer. I already write a code that tokenize my documents in word (as a List < String > or a List < Word > where Word is a class with only a kind of container with 3 public String : word, pos, lemma - pos stand for part-of-speech tag).

i'm not sure what i am going to index, maybe only "Word.lemma" or something like "Word.lemma + '#' + Word.pos", probably i will do some filtering from a stop word list based on part-of-speech.

btw here is my misunderstanding : i am not sure where i should plug to the Lucene API,

should i wrap my own tokenizer inside a new tokenizer ? should i rewrite TokenStream ? should i consider that this is the job of the analyzer rather than the tokenizer ? or shoud i bypass everything and directly build my index by adding my word directly inside index, using IndexWriter, Fieldable and so on ? (if so do you know of any documentation on how to create it's own index from scratch when bypass ing the analysis process)

best regards

EDIT : may be the simplest way should be to org.apache.commons.lang.StringUtils.join my Word-s with a space on the exit of my personal tokenizer/analyzer and rely on the WhiteSpaceTokenizer to feed Lucene (and other classical filters) ?

EDIT : so, i have read EnglishLemmaTokenizer pointed by Larsmans... but where i am still confused, is the fact that i end my own analysis/tokenization process with a complete *List < Word > * (Word class wrapping .form/.pos/.lemma) , this process rely on an external binary that i had wrapped in Java (this is a must do / can not do otherwise - it is not on a consumer point of view, i get the full list as a result) and i still not see how i should wrap it again to get back to the normal Lucene analysis process.

also i will be using the TermVector feature with TF.IDF like scoring (may be redefining my own), i may also be interested in the proximty searching, thus, discarding some words from their part-of- speech before providing them to a Lucene built-in tokenizer or internal analyzer may seem a bad idea. And i have difficulties in thinking of a "proper" way to wrap a Word.form / Word.pos / Word.lemma (or even other Word.anyOtherUnterestingAttribute) to the Lucene way.

EDIT: BTW, here is a piece of code that i write inspired by the one of @Larsmans :

class MyLuceneTokenizer extends TokenStream {

    private PositionIncrementAttribute posIncrement;
    private CharTermAttribute termAttribute;

    private List<TaggedWord> tagged;
    private int position;

    public MyLuceneTokenizer(Reader input, String language, String pathToExternalBinary) {
        super();

        posIncrement = addAttribute(PositionIncrementAttribute.class);
        termAttribute = addAttribute(CharTermAttribute.class); // TermAttribute is deprecated!

        // import com.google.common.io.CharStreams;            
        text = CharStreams.toString(input); //see http://stackoverflow.com/questions/309424/in-java-how-do-i-read-convert-an-inputstream-to-a-string
        tagged = MyTaggerWrapper.doTagging(text, language, pathToExternalBinary);
        position = 0;
    }

    public final boolean incrementToken()
            throws IOException {
        if (position > tagged.size() -1) {
            return false;
        }

        int increment = 1; // will probably be changed later depending upon any POS filtering or insertion @ same place...
        String form = (tagged.get(position)).word;
        String pos = (tagged.get(position)).pos;
        String lemma = (tagged.get(position)).lemma;

        // logic filtering should be here...
        // BTW we have broken the idea behing the Lucene nested filters or analyzers! 
        String kept = lemma;

        if (kept != null) {
            posIncrement.setPositionIncrement(increment);
            char[] asCharArray = kept.toCharArray();
            termAttribute.copyBuffer(asCharArray, 0, asCharArray.length);
            //termAttribute.setTermBuffer(kept);
            position++;
        }

        return true;
    }
}

class MyLuceneAnalyzer extends Analyzer {
    private String language;
    private String pathToExternalBinary;

    public MyLuceneAnalyzer(String language, String pathToExternalBinary) {
        this.language = language;
        this.pathToExternalBinary = pathToExternalBinary;
    }

    @Override
    public TokenStream tokenStream(String fieldname, Reader input) {
        return new MyLuceneTokenizer(input, language, pathToExternalBinary);
    }
}
like image 749
user1340802 Avatar asked May 18 '12 08:05

user1340802


1 Answers

There are various options here, but when I tried to wrap a POS tagger in Lucene, I found that implementing a new TokenStream and wrapping that inside a new Analyzer was the easiest option. In any case, mucking with IndexWriter directly seems like a bad idea. You can find my code on my GitHub.

like image 123
Fred Foo Avatar answered Oct 15 '22 21:10

Fred Foo