I am using NLTK, to classify documents - having 1 label each, with there being 10 type of documents. For text extraction, I am cleaning text (punctuation removal, html tag removal, lowercasing), removing nltk.corpus.stopwords, as well as my own collection of stopwords. For my document feature I am looking across all 50k documents, and gathering the top 2k words, by frequency (frequency_words) then for each document identifying which words in the document that are also in the global frequency_words. I am then passing in each document as hashmap of <code>{word: boolean}</code> into the nltk.NaiveBayesClassifier(...) I have a 20:80 test-training ratio in regards to the total number of documents. The issues I am having: <ol> <li>Is this classifier by NLTK, suitable to multi labelled data? - all examples I have seen are more about 2-class classification, such as whether something is declared as a positive or negative. </li> <li>The documents are such that they should have a set of key skills in - unfortunately I haven't got a corpus where these skills lie. So I have taken an approach with the understanding, a word count per document would not be a good document extractor - is this correct? Each document has been written by individuals, so I need to leave way for individual variation in the document. I am aware SkLearn MBNaiveBayes which deals with word count.</li> <li>Is there an alternative library I should be using, or variation of this algorithm?</li> </ol> Thanks!

Terminology: Documents are to be classified into 10 different classes which makes it a multi-class classification problem. Along with that if you want to classify documents with multiple labels then you can call it as multi-class multi-label classification. For the issues which you are facing, <ol> <li>nltk.NaiveBayesClassifier() is a out-of-box multi-class classifier. So yes you can use this to solve this problem. As per the multi-labelled data, if your labels are a,b,c,d,e,f,g,h,i,j then you have to define label 'b' of a particular document as '0,1,0,0,0,0,0,0,0,0'.</li> <li>Feature extraction is the hardest part of Classification (Machine learning). I recommend you to look into different algorithms to understand and select the one best suits for your data(without looking at your data, it is tough to recommend which algorithm/implementation to use)</li> <li>There are many different libraries out there for classification. I personally used scikit-learn and i can say it was good out-of-box classifier. </li> </ol> Note: Using scikit-learn, i was able to achieve results within a week, given data set was huge and other setbacks.

NLTK - Multi-labeled Classification

Tags:

python

nlp

nltk

document-classification

I am using NLTK, to classify documents - having 1 label each, with there being 10 type of documents.

For text extraction, I am cleaning text (punctuation removal, html tag removal, lowercasing), removing nltk.corpus.stopwords, as well as my own collection of stopwords.

For my document feature I am looking across all 50k documents, and gathering the top 2k words, by frequency (frequency_words) then for each document identifying which words in the document that are also in the global frequency_words.

I am then passing in each document as hashmap of {word: boolean} into the nltk.NaiveBayesClassifier(...) I have a 20:80 test-training ratio in regards to the total number of documents.

The issues I am having:

Is this classifier by NLTK, suitable to multi labelled data? - all examples I have seen are more about 2-class classification, such as whether something is declared as a positive or negative.
The documents are such that they should have a set of key skills in - unfortunately I haven't got a corpus where these skills lie. So I have taken an approach with the understanding, a word count per document would not be a good document extractor - is this correct? Each document has been written by individuals, so I need to leave way for individual variation in the document. I am aware SkLearn MBNaiveBayes which deals with word count.
Is there an alternative library I should be using, or variation of this algorithm?

Thanks!

941

asked May 09 '14 18:05

redrubia

1 Answers

Terminology: Documents are to be classified into 10 different classes which makes it a multi-class classification problem. Along with that if you want to classify documents with multiple labels then you can call it as multi-class multi-label classification.

For the issues which you are facing,

nltk.NaiveBayesClassifier() is a out-of-box multi-class classifier. So yes you can use this to solve this problem. As per the multi-labelled data, if your labels are a,b,c,d,e,f,g,h,i,j then you have to define label 'b' of a particular document as '0,1,0,0,0,0,0,0,0,0'.
Feature extraction is the hardest part of Classification (Machine learning). I recommend you to look into different algorithms to understand and select the one best suits for your data(without looking at your data, it is tough to recommend which algorithm/implementation to use)
There are many different libraries out there for classification. I personally used scikit-learn and i can say it was good out-of-box classifier.

Note: Using scikit-learn, i was able to achieve results within a week, given data set was huge and other setbacks.

answered Sep 27 '22 16:09

shyam

Related questions
                            
                                Vim w/Python: Make ":make" take me to the error
                            
                                How to flatten a nested list in python?
                            
                                locals().update(kwargs) is not working [duplicate]
                            
                                Probit regression using PyMC 3
                            
                                Python: naming of boolean/flag class attributes [closed]
                            
                                Python Multiple Simple Linear Regression
                            
                                Remove/set the non-zero diagonal elements of a sparse matrix in scipy
                            
                                Can I create a static Cython library using distutils?
                            
                                Connection between Python Server and Android Application
                            
                                How to send and receive SMS from python using usb modem? [closed]
                            
                                Dynamic traits do not survive pickling
                            
                                Logstash multiline codec for Celery stacktraces
                            
                                Exception happened during processing of request from ('127.0.0.1', xxxx) in SocketServer
                            
                                IPython sys.path different from python sys.path
                            
                                Matplotlib: check if grid is on?
                            
                                How are bignums represented internally?
                            
                                virtualenv doesn't copy all .py files from the lib/python directory
                            
                                How does Garbage Collection work with multiple running processes/threads?
                            
                                Python 2.7 how parse a date with format 2014-05-01 18:10:38-04:00 [duplicate]
                            
                                Gtk 3, python, appindicator, disable icon near label

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With