Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

NLTK/NLP buliding a many-to-many/multi-label subject classifier

I have a human tagged corpus of over 5000 subject indexed documents in XML. They vary in size from a few hundred kilobytes to a few hundred megabytes. Being short articles to manuscripts. They have all been subjected indexed as deep as the paragraph level. I am lucky to have such a corpus available, and I am trying to teach myself some NLP concepts. Admittedly, I've only begun. Thus far reading only the freely available NLTK book, streamhacker, and skimming jacobs(?) NLTK cookbook. I like to experiment with some ideas.

It was suggested to me, that perhaps, I could take bi-grams and use naive Bayes classification to tag new documents. I feel as if this is the wrong approach. a Naive Bayes is proficient at a true/false sort of relationship, but to use it on my hierarchical tag set I would need to build a new classifier for each tag. Nearly a 1000 of them. I have the memory and processor power to undertake such a task, but am skeptical of the results. However, I will be trying this approach first, to appease someones request. I should likely have this accomplished in the next day or two, but I predict the accuracy to be low.

So my question is a bit open ended. Laregly becuase of the nature of the discipline and the general unfamilirity with my data it will likely be hard to give an exact answer.

  1. What sort of classifier would be appropriate for this task. Was I wrong can a Bayes be used for more than a true/false sort of operation.

  2. what feature extraction should I pursue for such a task. I am not expecting much with the bigrams.

Each document also contains some citational information including, author/s, an authors gender of m,f,mix(m&f),and other (Gov't inst et al.), document type, published date(16th cent. to current), human analyst, and a few other general elements. I'd also appreciate some useful descriptive tasks to help investigate this data better for gender bias, analyst bias, etc. But realize that is a bit beyond the scope of this question.

like image 507
matchew Avatar asked Oct 12 '11 15:10

matchew


1 Answers

What sort of classifier would be appropriate for this task. Was I wrong can a Bayes be used for more than a true/false sort of operation.

You can easily build a multilabel classifier by building a separate binary classifier for each class, that can distinguish between that class and all others. The classes for which the corresponding classifier yields a positive value are the combined classifier's output. You can use Naïve Bayes for this or any other algorithm. (You could also play tricks with NB's probability output and a threshold value, but NB's probability estimates are notoriously bad; only its ranking among them is what makes it valuable.)

what feature extraction should I pursue for such a task

For text classification, tf-idf vectors are known to work well, but you haven't specified what the exact task is. Any metadata on the documents might work as well; try doing some simple statistical analysis. If any feature of the data is more frequently present in some classes than in others, it may be a useful feature.

like image 81
Fred Foo Avatar answered Sep 24 '22 06:09

Fred Foo