I'm working on a particular binary classification problem with a highly unbalanced dataset, and I was wondering if anyone has tried to implement specific techniques for dealing with unbalanced datasets (such as SMOTE) in classification problems using Spark's MLlib.
I'm using MLLib's Random Forest implementation and already tried the simplest approach of randomly undersampling the larger class but it didn't work as well as I expected.
I would appreciate any feedback regarding your experience with similar issues.
Thanks,
A widely adopted and perhaps the most straightforward method for dealing with highly imbalanced datasets is called resampling. It consists of removing samples from the majority class (under-sampling) and/or adding more examples from the minority class (over-sampling).
The simplest way to fix imbalanced dataset is simply balancing them by oversampling instances of the minority class or undersampling instances of the majority class. Using advanced techniques like SMOTE(Synthetic Minority Over-sampling Technique) will help you create new synthetic instances from minority class.
Results from the Logistic Regression Algorithm In supervised learning, a common strategy to overcome the class imbalance problem is to resample the original training dataset to decrease the overall level of class imbalance.
As of this very moment, the class weighting for the Random Forest algorithm is still under development (see here)
But If you're willing to try other classifiers - this functionality has been already added to the Logistic Regression.
Consider a case where we have 80% positives (label == 1) in the dataset, so theoretically we want to "under-sample" the positive class. The logistic loss objective function should treat the negative class (label == 0) with higher weight.
Here is an example in Scala of generating this weight, we add a new column to the dataframe for each record in the dataset:
def balanceDataset(dataset: DataFrame): DataFrame = { // Re-balancing (weighting) of records to be used in the logistic loss objective function val numNegatives = dataset.filter(dataset("label") === 0).count val datasetSize = dataset.count val balancingRatio = (datasetSize - numNegatives).toDouble / datasetSize val calculateWeights = udf { d: Double => if (d == 0.0) { 1 * balancingRatio } else { (1 * (1.0 - balancingRatio)) } } val weightedDataset = dataset.withColumn("classWeightCol", calculateWeights(dataset("label"))) weightedDataset }
Then, we create a classier as follow:
new LogisticRegression().setWeightCol("classWeightCol").setLabelCol("label").setFeaturesCol("features")
For more details, watch here: https://issues.apache.org/jira/browse/SPARK-9610
A different issue you should check - whether your features have a "predictive power" for the label you're trying to predict. In a case where after under-sampling you still have low precision, maybe that has nothing to do with the fact that your dataset is imbalanced by nature.
I would do a exploratory data analysis - If the classifier doesn't do better than a random choice, there is a risk that there simply is no connection between features and class.
Overfitting - a low error on your training set and a high error on your test set might be an indication that you overfit using an overly flexible feature set.
Bias variance - Check whether your classifier suffers from a high bias or high variance problem.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With