I have been blowing my brains out over the past 2-3 weeks on this problem. I have a multi-label (not multi-class) problem where each sample can belong to several of the labels. I have around 4.5 million text documents as training data and around 1 million as test data. The labels are around 35K. I am using scikit-learn. For feature extraction I was previously using TfidfVectorizer which didn't scale at all, now I am using HashVectorizer which is better but not that scalable given the number of documents that I have. <pre class="prettyprint"><code>vect = HashingVectorizer(strip_accents='ascii', analyzer='word', stop_words='english', n_features=(2 ** 10)) </code></pre> SKlearn provides a OneVsRestClassifier into which I can feed any estimator. For multi-label I found LinearSVC & SGDClassifier only to be working correctly. Acc to my benchmarks SGD outperforms LinearSVC both in memory & time. So, I have something like this <pre class="prettyprint"><code>clf = OneVsRestClassifier(SGDClassifier(loss='log', penalty='l2', n_jobs=-1), n_jobs=-1) </code></pre> But this suffers from some serious issues: <ol> <li>OneVsRest does not have a partial_fit method which makes it impossible for out-of-core learning. Are there any alternatives for that?</li> <li>HashingVectorizer/Tfidf both work on a single core and don't have any n_jobs parameter. It's taking too much time to hash the documents. Any alternatives/suggestions? Also is the value of n_features correct?</li> <li>I tested on 1 million documents. The Hashing takes 15 minutes and when it comes to clf.fit(X, y), I receive a MemoryError because OvR internally uses LabelBinarizer and it tries to allocate a matrix of dimensions (y x classes) which is fairly impossible to allocate. What should I do?</li> <li>Any other libraries out there which have reliable & scalable multi-label algorithms? I know of genism & mahout but both of them don't have anything for multi-label situations?</li> </ol>

<ol> <li>The algorithm that <code>OneVsRestClassifier</code> implements is very simple: it just fits K binary classifiers when there are K classes. You can do this in your own code instead of relying on <code>OneVsRestClassifier</code>. You can also do this on at most K cores in parallel: just run K processes. If you have more classes than processors in your machine, you can schedule training with a tool such as GNU parallel.</li> <li>Multi-core support in scikit-learn is work in progress; fine-grained parallel programming in Python is quite tricky. There are potential optimizations for <code>HashingVectorizer</code>, but I (one of the hashing code's authors) haven't come round to it yet.</li> <li>If you follow my (and Andreas') advice to do your own one-vs-rest, this shouldn't be a problem anymore.</li> <li>The trick in (1.) applies to any classification algorithm.</li> </ol> As for the number of features, it depends on the problem, but for large scale text classification 2^10 = 1024 seems very small. I'd try something around 2^18 - 2^22. If you train a model with L1 penalty, you can call <code>sparsify</code> on the trained model to convert its weight matrix to a more space-efficient format.

Scalable or online out-of-core multi-label classifiers

Tags:

machine-learning

classification

scikit-learn

text-classification

document-classification

I have been blowing my brains out over the past 2-3 weeks on this problem. I have a multi-label (not multi-class) problem where each sample can belong to several of the labels.

I have around 4.5 million text documents as training data and around 1 million as test data. The labels are around 35K.

I am using scikit-learn. For feature extraction I was previously using TfidfVectorizer which didn't scale at all, now I am using HashVectorizer which is better but not that scalable given the number of documents that I have.

vect = HashingVectorizer(strip_accents='ascii', analyzer='word', stop_words='english', n_features=(2 ** 10))

SKlearn provides a OneVsRestClassifier into which I can feed any estimator. For multi-label I found LinearSVC & SGDClassifier only to be working correctly. Acc to my benchmarks SGD outperforms LinearSVC both in memory & time. So, I have something like this

clf = OneVsRestClassifier(SGDClassifier(loss='log', penalty='l2', n_jobs=-1), n_jobs=-1)

But this suffers from some serious issues:

OneVsRest does not have a partial_fit method which makes it impossible for out-of-core learning. Are there any alternatives for that?
HashingVectorizer/Tfidf both work on a single core and don't have any n_jobs parameter. It's taking too much time to hash the documents. Any alternatives/suggestions? Also is the value of n_features correct?
I tested on 1 million documents. The Hashing takes 15 minutes and when it comes to clf.fit(X, y), I receive a MemoryError because OvR internally uses LabelBinarizer and it tries to allocate a matrix of dimensions (y x classes) which is fairly impossible to allocate. What should I do?
Any other libraries out there which have reliable & scalable multi-label algorithms? I know of genism & mahout but both of them don't have anything for multi-label situations?

436

asked Sep 08 '13 14:09

Gaurav Kumar

2 Answers

I would do the multi-label part by hand. The OneVsRestClassifier treats them as independent problems anyhow. You can just create the n_labels many classifiers and then call partial_fit on them. You can't use a pipeline if you only want to hash once (which I would advise), though. Not sure about speeding up hashing vectorizer. You gotta ask @Larsmans and @ogrisel for that ;)

Having partial_fit on OneVsRestClassifier would be a nice addition, and I don't see a particular problem with it, actually. You could also try to implement that yourself and send a PR.

118

answered Nov 16 '22 03:11

Andreas Mueller

The algorithm that OneVsRestClassifier implements is very simple: it just fits K binary classifiers when there are K classes. You can do this in your own code instead of relying on OneVsRestClassifier. You can also do this on at most K cores in parallel: just run K processes. If you have more classes than processors in your machine, you can schedule training with a tool such as GNU parallel.
Multi-core support in scikit-learn is work in progress; fine-grained parallel programming in Python is quite tricky. There are potential optimizations for HashingVectorizer, but I (one of the hashing code's authors) haven't come round to it yet.
If you follow my (and Andreas') advice to do your own one-vs-rest, this shouldn't be a problem anymore.
The trick in (1.) applies to any classification algorithm.

As for the number of features, it depends on the problem, but for large scale text classification 2^10 = 1024 seems very small. I'd try something around 2^18 - 2^22. If you train a model with L1 penalty, you can call sparsify on the trained model to convert its weight matrix to a more space-efficient format.

answered Nov 16 '22 04:11

Fred Foo

Related questions
                            
                                Perceptron learning algorithm doesn't work
                            
                                Representing Natural Language as RDF
                            
                                Scikit-learn χ² (chi-squared) statistic and corresponding contingency table
                            
                                sklearn: How to reset a Regressor or classifier object in sknn
                            
                                Input/output error while using google colab with google drive
                            
                                AssertionError: Could not compute output Tensor
                            
                                How to plot a learning curve for a keras experiment?
                            
                                how to use weight when training a weak learner for adaboost
                            
                                How to use Tensorflow Optimizer without recomputing activations in reinforcement learning program that returns control after each iteration?
                            
                                Python sklearn show loss values during training
                            
                                "UserWarning: An input could not be retrieved. It could be because a worker has died. We do not have any information on the lost sample."
                            
                                Multiple pipelines that merge within a sklearn Pipeline?
                            
                                Tied weights in Autoencoder
                            
                                Convert dataframe columns of object type to float
                            
                                How to Merge Numerical and Embedding Sequential Models to treat categories in RNN
                            
                                Improving k-means clustering
                            
                                What to do first: Feature Selection or Model Parameters Setting?
                            
                                How to prune a tree in R?
                            
                                What does sklearn "RidgeClassifier" do?
                            
                                Java Open Source Text Mining Frameworks [closed]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With