Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Performing gridsearch with python scikit-learn library on Amazon EC2 cluster

Sorry if this question is somewhat specific to the python Scikit-learn library.

I am trying to perform a grid search to find optimal parameter to scikit-learn's GradientBoostingRegressor. The problem is, I don't know where to start. In the past I have used R and RStudio setup but I am currenlty trying to migrate to Python for Data Mining and Scikit seems very promising.

Can anyone share possibly some simple setup code they may have used to compute on Amazon EC2 cluster or possibly point to useful example reference for that library for other machine learning algorithm?

Thank you.

like image 833
ak3nat0n Avatar asked Oct 30 '12 18:10

ak3nat0n


2 Answers

As far as I know, GBRT is a pretty sequential algorithm hence there is no trivial way to run it in parallel.

Random forests / ExtraTrees models are embarrassingly parallel, hence would be better candidate for training models on a cluster.

scikit-learn has some builtin support for single machine multiprocessing using joblib (check the docstring of models that accept an n_jobs argument). We plan to implement a task dispatch framework in joblib at some point instead. Thus we could for instance leverage IPython parallel as a backend to run on a cluster. However there is nothing ready out of the box for this currently.

If you are ready to invest some time doing it yourself I would advise you to have a look at StarCluster and its IPython plugin:

  • http://star.mit.edu/cluster/

  • http://star.mit.edu/cluster/docs/latest/plugins/ipython.html

like image 168
ogrisel Avatar answered Nov 04 '22 11:11

ogrisel


I totally agree with ogrisel - StarCluster is really handy as it allows you to setup an IPython cluster in no-time and supports spot-instances which is great because they are much cheaper than regular ones.

You can find some code in this gist that shows you how to do distributed grid search for sklearn's Gradient Boosting estimators on an IPython cluster.

It does grid search combined with cross-validation and stores the evaluated grid points in a MongoDB database.

The code automatically picks the best number of trees based on the averaged cross-validation score.

Happy tuning!

like image 30
Peter Prettenhofer Avatar answered Nov 04 '22 12:11

Peter Prettenhofer