Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to classify documents indexed with lucene

I have classified a set of documents with Lucene (fields: content, category). Each document has it's own category, but some of them are labeled as uncategorized. Is there any way to classify these documents easily in java?

like image 750
orezvani Avatar asked Feb 27 '12 05:02

orezvani


People also ask

How does Lucene index search work?

Lucene is able to achieve fast search responses because, instead of searching the text directly, it searches an index instead. This would be the equivalent of retrieving pages in a book related to a keyword by searching the index at the back of a book, as opposed to searching the words in each page of the book.

What is Lucene document?

Lucene is a Java full-text search engine. Lucene is not a complete application, but rather a code library and API that can easily be used to add search capabilities to applications. This is the official documentation for Apache Lucene 9.1.


3 Answers

Classification is a broad problem in the field of Machine Learning/Statistics. After reading your question what I feel you have used kind of SQL group by clause (though in Lucene). If you want the machine to classify the documents than you need to know Machine Learning Algorithms like Neural Networks, Bayesian, SVM etc. There are excellent libraries available in Java for these tasks. For this to work you will need features (a set of attributes extracted from data) on which you can train you Algorithm so that it may predict your classification label.

There are some good API's in Java (which allows you to concentrate on code without going in too much in understanding the mathematical theory behind those Algorithms, though if you know it would be very advantageous). Weka is good. I also came across a couple of books from Manning which have handled these tasks well. Here you go:

Chapter 10 (Classification) of Collective Intelligence in Action: http://www.manning.com/alag/

Chapter 5 (Classification) of Algorithms of Intelligent Web: http://www.manning.com/marmanis/

These are absolutely fantastic material (for Java people) on classification particularly suited for people who just dont want to dive in in to the theory (though very essential :)) and just quickly want a working code.

Collective Intelligence in Action has solved the problem of classification using JDM and Weka. So have a look at these two for your tasks.

like image 67
Yavar Avatar answered Oct 29 '22 11:10

Yavar


Yes you can use similarity queries such as implemented by the MoreLikeThisQuery class for this kind of things (assuming you have some large text field in the documents for your lucene index). Have a look at the javadoc of the underlying MoreLikeThis class for details on how it works.

To turn your lucene index into a text classifier you have two options:

  1. For any new text to classifier, query for the top 10 or 50 most similar documents that have at least one category, sum the category occurrences among those "neighbors" and pick up the top 3 frequent categories among those similar documents (for instance).

  2. Alternatively you can index a new set of aggregate documents, one for each category by concatenating (all or a sample of) the text of the documents of this category. Then run similarity query with you input text directly on those "fake" documents.

The first strategy is known in machine learning as k-Nearest Neighbors classification. The second is a hack :)

If you have many categories (say more than 1000) the second option might be better (faster to classify). I have not run any clean performance evaluation though.

You might also find this blog post interesting.

If you want to use Solr, your need to enable the MoreLikeThisHandler and set termVectors=true on the content field.

The sunburnt Solr client for python is able to perform mlt queries. Here is a prototype python classifier that uses Solr for classification using an index of Wikipedia categories:

https://github.com/ogrisel/pignlproc/blob/master/examples/topic-corpus/categorize.py

like image 37
ogrisel Avatar answered Oct 29 '22 12:10

ogrisel


As of Lucene 5.2.1, you can use indexed documents to classify new documents. Out of the box, Lucene offers a naive Bayes classifier, a k-Nearest Neighbor classifier (based on the MoreLikeThis class) and a Perceptron based classifier.

The drawback is that all of these classes are marked with experimental warnings and documented with links to Wikipedia.

like image 36
approxiblue Avatar answered Oct 29 '22 12:10

approxiblue