Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Can sample weight be used in Spark MLlib Random Forest training?

I am using Spark 1.5.0 MLlib Random Forest algorithm (Scala code) to do two-class classification. As the dataset I am using is highly imbalanced, so the majority class is down sampled at 10% sampling rate.

Is it possible to use the sampling weight (10 in this case) in the Spark Random Forest training? I don't see weight among the input parameters for trainClassifier() in Random Forest.

like image 432
machine_learner Avatar asked Mar 11 '16 20:03

machine_learner


People also ask

How is logistic regression in spark ml trained?

It is a special case of Generalized Linear models that predicts the probability of the outcomes. In spark.ml logistic regression can be used to predict a binary outcome by using binomial logistic regression, or it can be used to predict a multiclass outcome by using multinomial logistic regression.

What is spark ML?

The Apache Spark machine learning library (MLlib) allows data scientists to focus on their data problems and models instead of solving the complexities surrounding distributed data (such as infrastructure, configurations, and so on). In this tutorial module, you will learn how to: Load sample data.


1 Answers

Not at all in Spark 1.5 and only partially (Logistic/LinearRegression) in Spark 1.6

https://issues.apache.org/jira/browse/SPARK-7685

Here's the umbrella JIRA tracking all the subtasks

https://issues.apache.org/jira/browse/SPARK-9610

like image 55
Edi Bice Avatar answered Sep 25 '22 21:09

Edi Bice