I am using Spark 1.5.0 MLlib Random Forest algorithm (Scala code) to do two-class classification. As the dataset I am using is highly imbalanced, so the majority class is down sampled at 10% sampling rate.
Is it possible to use the sampling weight (10 in this case) in the Spark Random Forest training? I don't see weight among the input parameters for trainClassifier()
in Random Forest.
It is a special case of Generalized Linear models that predicts the probability of the outcomes. In spark.ml logistic regression can be used to predict a binary outcome by using binomial logistic regression, or it can be used to predict a multiclass outcome by using multinomial logistic regression.
The Apache Spark machine learning library (MLlib) allows data scientists to focus on their data problems and models instead of solving the complexities surrounding distributed data (such as infrastructure, configurations, and so on). In this tutorial module, you will learn how to: Load sample data.
Not at all in Spark 1.5 and only partially (Logistic/LinearRegression) in Spark 1.6
https://issues.apache.org/jira/browse/SPARK-7685
Here's the umbrella JIRA tracking all the subtasks
https://issues.apache.org/jira/browse/SPARK-9610
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With