Multi-label classification for large dataset

Question

I am solving a multilabel classification problem. I have about 6 Million of rows to be processed which are huge chunks of text. They are tagged with multiple tags in a separate column.

Any advice on what scikit libraries can help me scale up my code. I am using One-vs-Rest and SVM within it. But they don't scale beyond 90-100k rows.

classifier = Pipeline([
('vectorizer', CountVectorizer(min_df=1)), 
('tfidf', TfidfTransformer()),
('clf', OneVsRestClassifier(LinearSVC()))])

Simon · Accepted Answer

SVM's scale well as the number of columns increase, but poorly with the number of rows, as they are essentially learning which rows constitute the support vectors. I have seen this as a common complaint with SVM's, but most people don't understand why, as they typically scale well for most reasonable datasets.

You will want 1 vs the rest, as you are using. One vs One will not scale well for this (n(n-1) classifiers, vs n).
I set a minimum df for the terms you consider to at least 5, maybe higher, which will drastically reduce your row size. You will find a lot of words occur once or twice, and they add no value to your classification as at that frequency, an algorithm cannot possibly generalize. Stemming may help there.
Also remove stop words (the, a, an, prepositions, etc, look on google). That will further cut down the number of columns.
Once you have reduced your column size as described, I would try to eliminate some rows. If there are documents that are very noisy, or very short after steps 1-3, or maybe very long, I would look to eliminate them. Look at the s.d. and mean doc length, and plot the length of the docs (in terms of word count) against the frequency at that length to decide
If the dataset is still too large, I would suggest a decision tree, or naive bayes, both are present in sklearn. DT's scale very well. I would set a depth threshold to limit the depth of the tree, as otherwise it will try to grow a humungous tree to memorize that dataset. NB on the other hand is very fast to train and handles large numbers of columns quite well. If the DT works well, you can try RF with a small number of trees, and leverage the ipython parallelization to multi-thread.
Alternatively, segment your data into smaller datasets, train a classifier on each, persist that to disk, and then build an ensemble classifier from those classifiers.

Multi-label classification for large dataset

Tags:

python

machine-learning

classification

nlp

scikit-learn

user3048524

1 Answers

Simon

Recent Activity

Donate For Us

Multi-label classification for large dataset

Tags:

python

machine-learning

classification

nlp

scikit-learn

user3048524

1 Answers

Simon

Related questions

Recent Activity

Donate For Us