Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Randomforest in amazon aws sagemaker?

I am looking to recreate a randomforest model built locally, and deploy it through sagemaker. The model is very basic, but for comparison I would like to use the same in sagemaker. I don't see randomforest among sagemaker's built in algorithms (which seems weird) - is my only option to go the route of deploying my own custom model? Still learning about containers, and it seems like a lot of work for something that is just a simple randomforestclassifier() call locally. I just want to baseline against the out of the box randomforest model, and show that it works the same when deployed through AWS sagemaker.

like image 735
L Xandor Avatar asked Jan 01 '23 21:01

L Xandor


1 Answers

edit 03/30/2020: adding a link to the the SageMaker Sklearn random forest demo


in SageMaker you have 3 options to write scientific code:

  • Built-in algorithms
  • Open-source pre-written containers (available for sklearn, tensorflow, pytorch, mxnet, chainer. Keras can be written in the tensorflow and mxnet containers)
  • Bring your own container (for R for example)

At the time of writing this post there is no random forest classifier nor regressor in the built-in library. There is an algorithm called Random Cut Forest in the built-in library but it is an unsupervised algorithm for anomaly detection, a different use-case than the scikit-learn random forest used in a supervised fashion (also answered in StackOverflow here). But it is easy to use the open-source pre-written scikit-learn container to implement your own. There is a demo showing how to use Sklearn's random forest in SageMaker, with training orchestration bother from the high-level SDK and boto3. You can also use this other public sklearn-on-sagemaker demo and change the model. A benefit of the pre-written containers over the "Bring your own" option is that the dockerfile is already written, and web serving stack too.

Regarding your surprise that Random Forest is not featured in the built-in algos, the library and its 18 algos already cover a rich set of use-cases. For example for supervised learning over structured data (the usual use-case for the random forest), if you want to stick to the built-ins, depending on your priorities (accuracy, inference latency, training scale, costs...) you can use SageMaker XGBoost (XGBoost has been winning tons of datamining competitions - every winning team in the top10 of the KDDcup 2015 used XGBoost according to the XGBoost paper - and scales well) and linear learner, which is extremely fast at inference and can be trained at scale, in mini-batch fashion over GPU(s). Factorization Machines (linear + 2nd degree interaction with weights being column embedding dot-products) and SageMaker kNN are other options. Also, things are not frozen in stone, and the list of built-in algorithms is being improved fast.

like image 184
Olivier Cruchant Avatar answered Jan 05 '23 16:01

Olivier Cruchant