Performing gridsearch with python scikit-learn library on Amazon EC2 cluster

Question

Sorry if this question is somewhat specific to the python Scikit-learn library.

I am trying to perform a grid search to find optimal parameter to scikit-learn's GradientBoostingRegressor. The problem is, I don't know where to start. In the past I have used R and RStudio setup but I am currenlty trying to migrate to Python for Data Mining and Scikit seems very promising.

Can anyone share possibly some simple setup code they may have used to compute on Amazon EC2 cluster or possibly point to useful example reference for that library for other machine learning algorithm?

Thank you.

ogrisel · Accepted Answer

As far as I know, GBRT is a pretty sequential algorithm hence there is no trivial way to run it in parallel.

Random forests / ExtraTrees models are embarrassingly parallel, hence would be better candidate for training models on a cluster.

scikit-learn has some builtin support for single machine multiprocessing using joblib (check the docstring of models that accept an n_jobs argument). We plan to implement a task dispatch framework in joblib at some point instead. Thus we could for instance leverage IPython parallel as a backend to run on a cluster. However there is nothing ready out of the box for this currently.

If you are ready to invest some time doing it yourself I would advise you to have a look at StarCluster and its IPython plugin:

http://star.mit.edu/cluster/
http://star.mit.edu/cluster/docs/latest/plugins/ipython.html

Peter Prettenhofer · Answer

I totally agree with ogrisel - StarCluster is really handy as it allows you to setup an IPython cluster in no-time and supports spot-instances which is great because they are much cheaper than regular ones.

You can find some code in this gist that shows you how to do distributed grid search for sklearn's Gradient Boosting estimators on an IPython cluster.

It does grid search combined with cross-validation and stores the evaluated grid points in a MongoDB database.

The code automatically picks the best number of trees based on the averaged cross-validation score.

Happy tuning!

Performing gridsearch with python scikit-learn library on Amazon EC2 cluster

Tags:

python

amazon-ec2

scikit-learn

ak3nat0n

2 Answers

ogrisel

Peter Prettenhofer

Recent Activity

Donate For Us

Performing gridsearch with python scikit-learn library on Amazon EC2 cluster

Tags:

python

amazon-ec2

scikit-learn

ak3nat0n

2 Answers

ogrisel

Peter Prettenhofer

Related questions

Recent Activity

Donate For Us