How to classify documents indexed with lucene

Tags:

I have classified a set of documents with Lucene (fields: content, category). Each document has it's own category, but some of them are labeled as uncategorized. Is there any way to classify these documents easily in java?

750

asked Feb 27 '12 05:02

orezvani

3 Answers

Classification is a broad problem in the field of Machine Learning/Statistics. After reading your question what I feel you have used kind of SQL group by clause (though in Lucene). If you want the machine to classify the documents than you need to know Machine Learning Algorithms like Neural Networks, Bayesian, SVM etc. There are excellent libraries available in Java for these tasks. For this to work you will need features (a set of attributes extracted from data) on which you can train you Algorithm so that it may predict your classification label.

There are some good API's in Java (which allows you to concentrate on code without going in too much in understanding the mathematical theory behind those Algorithms, though if you know it would be very advantageous). Weka is good. I also came across a couple of books from Manning which have handled these tasks well. Here you go:

Chapter 10 (Classification) of Collective Intelligence in Action: http://www.manning.com/alag/

Chapter 5 (Classification) of Algorithms of Intelligent Web: http://www.manning.com/marmanis/

These are absolutely fantastic material (for Java people) on classification particularly suited for people who just dont want to dive in in to the theory (though very essential :)) and just quickly want a working code.

Collective Intelligence in Action has solved the problem of classification using JDM and Weka. So have a look at these two for your tasks.

answered Oct 29 '22 11:10

Yavar

Yes you can use similarity queries such as implemented by the MoreLikeThisQuery class for this kind of things (assuming you have some large text field in the documents for your lucene index). Have a look at the javadoc of the underlying MoreLikeThis class for details on how it works.

To turn your lucene index into a text classifier you have two options:

For any new text to classifier, query for the top 10 or 50 most similar documents that have at least one category, sum the category occurrences among those "neighbors" and pick up the top 3 frequent categories among those similar documents (for instance).
Alternatively you can index a new set of aggregate documents, one for each category by concatenating (all or a sample of) the text of the documents of this category. Then run similarity query with you input text directly on those "fake" documents.

The first strategy is known in machine learning as k-Nearest Neighbors classification. The second is a hack :)

If you have many categories (say more than 1000) the second option might be better (faster to classify). I have not run any clean performance evaluation though.

You might also find this blog post interesting.

If you want to use Solr, your need to enable the MoreLikeThisHandler and set termVectors=true on the content field.

The sunburnt Solr client for python is able to perform mlt queries. Here is a prototype python classifier that uses Solr for classification using an index of Wikipedia categories:

https://github.com/ogrisel/pignlproc/blob/master/examples/topic-corpus/categorize.py

answered Oct 29 '22 12:10

ogrisel

As of Lucene 5.2.1, you can use indexed documents to classify new documents. Out of the box, Lucene offers a naive Bayes classifier, a k-Nearest Neighbor classifier (based on the MoreLikeThis class) and a Perceptron based classifier.

The drawback is that all of these classes are marked with experimental warnings and documented with links to Wikipedia.

answered Oct 29 '22 12:10

approxiblue

Related questions
                            
                                Inconsistency between ZipEntry size for ZipInputStream and JarInputStream
                            
                                how can I detect farsi web pages by tika?
                            
                                ClassCastException while using varargs and generics
                            
                                Exception setting property value with CGLIB
                            
                                java log4j choose which file to log to
                            
                                How to run all my inner class junit tests at once
                            
                                Is it possible to mimic this java enum code in c#
                            
                                Play Framework - CRUD naming convention
                            
                                EAR version 1.4, 5, 6
                            
                                Synchronization pattern
                            
                                Stress testing an android app
                            
                                Java shift operator
                            
                                How to retrieve objects values stored in a Java ArrayList
                            
                                Java JFrame not updating settings of a button
                            
                                UML Class Diagram Generator for PHP and/or Java
                            
                                Run Java Code after a Method Return?
                            
                                Java compare strings that have characters and numbers
                            
                                Generate a visual representation from a table with relation weight
                            
                                How do I get this text using Jsoup?
                            
                                Why do my variables not go out of scope?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to classify documents indexed with lucene

Tags:

java

machine-learning

lucene

classification