Multiprocessing scikit-learn

Question

I got linearsvc working against training set and test set using load_file method i am trying to get It working on Multiprocessor enviorment.

How can i get multiprocessing work on LinearSVC().fit() LinearSVC().predict()? I am not really familiar with datatypes of scikit-learn yet.

I am also thinking about splitting samples into multiple arrays but i am not familiar with numpy arrays and scikit-learn data structures.

Doing this it will be easier to put into multiprocessing.pool() , with that , split samples into chunks , train them and combine trained set back later , would it work ?

EDIT: Here is my scenario:

lets say , we have 1 million files in training sample set , when we want to distribute processing of Tfidfvectorizer on several processors we have to split those samples (for my case it will only have two categories , so lets say 500000 each samples to train) . My server have 24 cores with 48 GB , so i want to split each topics into number of chunks 1000000 / 24 and process Tfidfvectorizer on them. Like that i would do to Testing sample set , as well as SVC.fit() and decide(). Does it make sense?

Thanks.

PS: Please do not close this .

Andreas Mueller · Accepted Answer

I think using SGDClassifier instead of LinearSVC for this kind of data would be a good idea, as it is much faster. For the vectorization, I suggest you look into the hash transformer PR.

For the multiprocessing: You can distribute the data sets across cores, do partial_fit, get the weight vectors, average them, distribute them to the estimators, do partial fit again.

Doing parallel gradient descent is an area of active research, so there is no ready-made solution there.

How many classes does your data have btw? For each class, a separate will be trained (automatically). If you have nearly as many classes as cores, it might be better and much easier to just do one class per core, by specifying n_jobs in SGDClassifier.

ogrisel · Answer

For linear models (LinearSVC, SGDClassifier, Perceptron...) you can chunk your data, train independent models on each chunk and build an aggregate linear model (e.g. SGDClasifier) by sticking in it the average values of coef_ and intercept_ as attributes. The predict method of LinearSVC, SGDClassifier, Perceptron compute the same function (linear prediction using a dot product with an intercept_ threshold and One vs All multiclass support) so the specific model class you use for holding the average coefficient is not important.

However as previously said the tricky point is parallelizing the feature extraction and current scikit-learn (version 0.12) does not provide any way to do this easily.

Edit: scikit-learn 0.13+ now has a hashing vectorizer that is stateless.

However as previously said the tricky point is parallelizing the feature extraction and current scikit-learn (version 0.12) does not provide any way to do this easily.

Edit: scikit-learn 0.13+ now has a hashing vectorizer that is stateless.

Multiprocessing scikit-learn

Tags:

python

multithreading

machine-learning

numpy

scikit-learn

Phyo Arkar Lwin

2 Answers

Andreas Mueller

ogrisel

Recent Activity

Donate For Us

Multiprocessing scikit-learn

Tags:

python

multithreading

machine-learning

numpy

scikit-learn

Phyo Arkar Lwin

2 Answers

Andreas Mueller

ogrisel

Related questions

Recent Activity

Donate For Us