Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Multi-label classification for large dataset

I am solving a multilabel classification problem. I have about 6 Million of rows to be processed which are huge chunks of text. They are tagged with multiple tags in a separate column.

Any advice on what scikit libraries can help me scale up my code. I am using One-vs-Rest and SVM within it. But they don't scale beyond 90-100k rows.

classifier = Pipeline([
('vectorizer', CountVectorizer(min_df=1)), 
('tfidf', TfidfTransformer()),
('clf', OneVsRestClassifier(LinearSVC()))])
like image 202
user3048524 Avatar asked Nov 29 '13 08:11

user3048524


1 Answers

SVM's scale well as the number of columns increase, but poorly with the number of rows, as they are essentially learning which rows constitute the support vectors. I have seen this as a common complaint with SVM's, but most people don't understand why, as they typically scale well for most reasonable datasets.

  1. You will want 1 vs the rest, as you are using. One vs One will not scale well for this (n(n-1) classifiers, vs n).
  2. I set a minimum df for the terms you consider to at least 5, maybe higher, which will drastically reduce your row size. You will find a lot of words occur once or twice, and they add no value to your classification as at that frequency, an algorithm cannot possibly generalize. Stemming may help there.
  3. Also remove stop words (the, a, an, prepositions, etc, look on google). That will further cut down the number of columns.
  4. Once you have reduced your column size as described, I would try to eliminate some rows. If there are documents that are very noisy, or very short after steps 1-3, or maybe very long, I would look to eliminate them. Look at the s.d. and mean doc length, and plot the length of the docs (in terms of word count) against the frequency at that length to decide
  5. If the dataset is still too large, I would suggest a decision tree, or naive bayes, both are present in sklearn. DT's scale very well. I would set a depth threshold to limit the depth of the tree, as otherwise it will try to grow a humungous tree to memorize that dataset. NB on the other hand is very fast to train and handles large numbers of columns quite well. If the DT works well, you can try RF with a small number of trees, and leverage the ipython parallelization to multi-thread.
  6. Alternatively, segment your data into smaller datasets, train a classifier on each, persist that to disk, and then build an ensemble classifier from those classifiers.
like image 160
Simon Avatar answered Oct 16 '22 04:10

Simon