Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Scalable or online out-of-core multi-label classifiers

I have been blowing my brains out over the past 2-3 weeks on this problem. I have a multi-label (not multi-class) problem where each sample can belong to several of the labels.

I have around 4.5 million text documents as training data and around 1 million as test data. The labels are around 35K.

I am using scikit-learn. For feature extraction I was previously using TfidfVectorizer which didn't scale at all, now I am using HashVectorizer which is better but not that scalable given the number of documents that I have.

vect = HashingVectorizer(strip_accents='ascii', analyzer='word', stop_words='english', n_features=(2 ** 10))

SKlearn provides a OneVsRestClassifier into which I can feed any estimator. For multi-label I found LinearSVC & SGDClassifier only to be working correctly. Acc to my benchmarks SGD outperforms LinearSVC both in memory & time. So, I have something like this

clf = OneVsRestClassifier(SGDClassifier(loss='log', penalty='l2', n_jobs=-1), n_jobs=-1)

But this suffers from some serious issues:

  1. OneVsRest does not have a partial_fit method which makes it impossible for out-of-core learning. Are there any alternatives for that?
  2. HashingVectorizer/Tfidf both work on a single core and don't have any n_jobs parameter. It's taking too much time to hash the documents. Any alternatives/suggestions? Also is the value of n_features correct?
  3. I tested on 1 million documents. The Hashing takes 15 minutes and when it comes to clf.fit(X, y), I receive a MemoryError because OvR internally uses LabelBinarizer and it tries to allocate a matrix of dimensions (y x classes) which is fairly impossible to allocate. What should I do?
  4. Any other libraries out there which have reliable & scalable multi-label algorithms? I know of genism & mahout but both of them don't have anything for multi-label situations?
like image 436
Gaurav Kumar Avatar asked Sep 08 '13 14:09

Gaurav Kumar


People also ask

How do you handle multi-label classification?

There are two main methods for tackling a multi-label classification problem: problem transformation methods and algorithm adaptation methods. Problem transformation methods transform the multi-label problem into a set of binary classification problems, which can then be handled using single-class classifiers.

What is multi-label learning?

Definition. Multi-label learning is an extension of the standard supervised learning setting. In contrast to standard supervised learning where one training example is asso- ciated with a single class label, in multi-label learning one training example is associated with multiple class labels simultaneously.

What is multi-label text classification?

Multi-Label Text Classification means a classification task with more than two classes; each label is mutually exclusive. The classification makes the assumption that each sample is assigned to one and only one label. On the opposite hand, Multi-label classification assigns to every sample a group of target labels.


2 Answers

I would do the multi-label part by hand. The OneVsRestClassifier treats them as independent problems anyhow. You can just create the n_labels many classifiers and then call partial_fit on them. You can't use a pipeline if you only want to hash once (which I would advise), though. Not sure about speeding up hashing vectorizer. You gotta ask @Larsmans and @ogrisel for that ;)

Having partial_fit on OneVsRestClassifier would be a nice addition, and I don't see a particular problem with it, actually. You could also try to implement that yourself and send a PR.

like image 118
Andreas Mueller Avatar answered Nov 16 '22 03:11

Andreas Mueller


  1. The algorithm that OneVsRestClassifier implements is very simple: it just fits K binary classifiers when there are K classes. You can do this in your own code instead of relying on OneVsRestClassifier. You can also do this on at most K cores in parallel: just run K processes. If you have more classes than processors in your machine, you can schedule training with a tool such as GNU parallel.
  2. Multi-core support in scikit-learn is work in progress; fine-grained parallel programming in Python is quite tricky. There are potential optimizations for HashingVectorizer, but I (one of the hashing code's authors) haven't come round to it yet.
  3. If you follow my (and Andreas') advice to do your own one-vs-rest, this shouldn't be a problem anymore.
  4. The trick in (1.) applies to any classification algorithm.

As for the number of features, it depends on the problem, but for large scale text classification 2^10 = 1024 seems very small. I'd try something around 2^18 - 2^22. If you train a model with L1 penalty, you can call sparsify on the trained model to convert its weight matrix to a more space-efficient format.

like image 34
Fred Foo Avatar answered Nov 16 '22 04:11

Fred Foo