Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

text categorization classifiers

Does anybody know of good open-source text-categorization models? I know about Stanford Classifier, Weka, Mallet, etc. but all of them require training.

I need to classify news articles into Sports/Politics/Health/Gaming/etc. Is there any pre-trained models out there?

Alchemy, OpenCalais, etc. are not options. I need open-source tools (preferably in Java).

like image 851
MFARID Avatar asked Mar 07 '13 15:03

MFARID


People also ask

What is the best classifier for text classification?

Linear Support Vector Machine is widely regarded as one of the best text classification algorithms.

What is an example of text classification?

Some Examples of Text Classification: Sentiment Analysis. Language Detection. Fraud Profanity & Online Abuse Detection.

What are classifiers in classification?

What is a Classifier? In data science, a classifier is a type of machine learning algorithm used to assign a class label to a data input. An example is an image recognition classifier to label an image (e.g., “car,” “truck,” or “person”).


3 Answers

Having a pre-trained model assumes that the corpus that was used to train is from the exact same domain as the documents you are trying to classify. Generally this is not going to give you the results you want because you don't have the original corpus. Machine learning is not static, when you train a classifier you need to update the model when new features/information becomes available.

Take for example classifying news articles like you want in the domain of Sports/Politics/Health/Gaming/etc.

First what language? Are we talking about english only? How was the original corpus labeled? And the biggest unknown is the etc. category.

Training your own classifier is really really easy. If you are classifying text, MALLET is the best choice. You can be up and running in lest than 10 minutes. You can add MALLET into your own application in under 1 hour.

If you want to classify news articles there are a lot of open source corpora that you can use as a base to start training. I would start with Reuters-21578 or RCV-1.

like image 150
Shane Avatar answered Oct 18 '22 19:10

Shane


There are a lot of classifiers out there depending your need. First, I think you may want to narrow down what do you want to do with the classifiers.

And training is part of steps of classification, I don't think you will find much pre-trained classifiers out there. Besides, training is almost always easy part of the classification.

That being said, there are actually a lot of resources you can look at. I can't pretend to take credit of this, but this is one of the examples:

Weka - is a collection of machine learning algorithms for data mining. It is one of the most popular text classification frameworks. It contains implementations of a wide variety of algorithms including Naive Bayes and Support Vector Machines (SVM, listed under SMO) [Note: Other commonly used non-Java SVM implementations are SVM-Light, LibSVM, and SVMTorch]. A related project is Kea (Keyphrase Extraction Algorithm) an algorithm for extracting keyphrases from text documents.

Apache Lucene Mahout - An incubator project to created highly scalable distributed implementations of common machine learning algorithms on top of the Hadoop map-reduce framework.

Source: http://www.searchenginecaffe.com/2007/03/java-open-source-text-mining-and.html

like image 28
Hearty Avatar answered Oct 18 '22 18:10

Hearty


What you mean by classification is very important.

Classification is a supervised task, which requires a pre-labeled corpus beforehand. Moving from the already labeled corpus, you have to create a model by using several methods and approaches and finally you can classify an unlabeled test corpus by using that model. If this is the case, you can use a multi-class classifier which is generally a binary tree application of a binary classifier. State of the art approach for such kind of a task is using a branch of machine learning, SVM. Two of the best SVM classifiers are LibSVM and SVMlight. These are open-source, easy to use and include multi-class classification tools. Finally, you have to make a literature survey in order to understand what to do in addition to obtain good results, because using those classifiers are not enough by themselves. You have to manipulate/pre-process your corpus in order to extract information bearing parts (e.g. unigrams) and excluding noisy parts. In general, you most probably have a long way to go, but NLP is a very interesting topic and worthwhile to work on.

However, if what you mean by classification is clustering, then the problem will be more complicated. Clustering is an un-supervised task, which means you will include no information to the program you are using about which example belongs to which group/topic/class. There are also academic work on hybrid semi-supervised approaches, but they are a bit diverging from the real purpose of clustering problem. The pre-processing that you need to use while manipulating your corpus bears a similar nature with what you have to do in classification problem, so I will not mention it again. To do clustering, there are several approaches you have to follow. First, you can use LDA (Latent Dirichlet Allocation) method to reduce the dimensionality (number of dimensions of your feature-space) of your corpus, which will contribute to efficiency and information gain from features. Beside or after LDA, you can use Hierarchical Clustering or similar other methods such as K-Means in order to cluster your unlabeled corpus. You can use Gensim or Scikit-Learn as open-source tools for clustering. Both are powerful, well documented and easy to use tools.

In all cases, make a lot of academic reading and try to understand the theory beneath those tasks and problems. By this way, you can come up with innovative and efficient solutions for what you are specifically dealing with, because the problems in NLP are generally corpus dependent and you are generally on your own while dealing with your specific problem. It is very difficult to find generic and ready-to-use solutions and I do not recommend to rely on such an option as well.

I may over-answered your question, sorry for the irrelevant parts.

Good luck =)

like image 2
clancularius Avatar answered Oct 18 '22 20:10

clancularius