Dealing with unbalanced datasets in Spark MLlib

Tags:

I'm working on a particular binary classification problem with a highly unbalanced dataset, and I was wondering if anyone has tried to implement specific techniques for dealing with unbalanced datasets (such as SMOTE) in classification problems using Spark's MLlib.

I'm using MLLib's Random Forest implementation and already tried the simplest approach of randomly undersampling the larger class but it didn't work as well as I expected.

I would appreciate any feedback regarding your experience with similar issues.

Thanks,

526

asked Oct 27 '15 16:10

dbakr

1 Answers

Class weight with Spark ML

As of this very moment, the class weighting for the Random Forest algorithm is still under development (see here)

But If you're willing to try other classifiers - this functionality has been already added to the Logistic Regression.

Consider a case where we have 80% positives (label == 1) in the dataset, so theoretically we want to "under-sample" the positive class. The logistic loss objective function should treat the negative class (label == 0) with higher weight.

Here is an example in Scala of generating this weight, we add a new column to the dataframe for each record in the dataset:

def balanceDataset(dataset: DataFrame): DataFrame = {      // Re-balancing (weighting) of records to be used in the logistic loss objective function     val numNegatives = dataset.filter(dataset("label") === 0).count     val datasetSize = dataset.count     val balancingRatio = (datasetSize - numNegatives).toDouble / datasetSize      val calculateWeights = udf { d: Double =>       if (d == 0.0) {         1 * balancingRatio       }       else {         (1 * (1.0 - balancingRatio))       }     }      val weightedDataset = dataset.withColumn("classWeightCol", calculateWeights(dataset("label")))     weightedDataset   }

Then, we create a classier as follow:

new LogisticRegression().setWeightCol("classWeightCol").setLabelCol("label").setFeaturesCol("features")

For more details, watch here: https://issues.apache.org/jira/browse/SPARK-9610

- Predictive Power

A different issue you should check - whether your features have a "predictive power" for the label you're trying to predict. In a case where after under-sampling you still have low precision, maybe that has nothing to do with the fact that your dataset is imbalanced by nature.

I would do a exploratory data analysis - If the classifier doesn't do better than a random choice, there is a risk that there simply is no connection between features and class.

Perform correlation analysis for every feature with the label.
Generating class specific histograms for features (i.e. plotting histograms of the data for each class, for a given feature on the same axis) can also be a good way to show if a feature discriminates well between the two classes.

Overfitting - a low error on your training set and a high error on your test set might be an indication that you overfit using an overly flexible feature set.

Bias variance - Check whether your classifier suffers from a high bias or high variance problem.

Training error vs. validation error - graph the validation error and training set error, as a function of training examples (do incremental learning)
- If the lines seem to converge to the same value and are close at the end, then your classifier has high bias. In such case, adding more data won't help. Change the classifier for a one that has higher variance, or simply lower the regularization parameter of your current one.
- If on the other hand the lines are quite far apart, and you have a low training set error but high validation error, then your classifier has too high variance. In this case getting more data is very likely to help. If after getting more data the variance will still be too high, you can increase the regularization parameter.

182

answered Oct 01 '22 04:10

Serendipity

Related questions
                            
                                Apache Spark vs Apache Ignite [closed]
                            
                                How to load IPython shell with PySpark
                            
                                pyspark: count distinct over a window
                            
                                Calculating duration by subtracting two datetime columns in string format
                            
                                Spark DataFrame: count distinct values of every column
                            
                                PySpark serialization EOFError
                            
                                Which of the many Spark/Scala kernels for Jupyter/IPython to choose? [closed]
                            
                                Pandas dataframe to Spark dataframe "Can not merge type error"
                            
                                How to specify the version of Python for spark-submit to use?
                            
                                How to know what is the reason for ClosedChannelExceptions with spark-shell in YARN client mode?
                            
                                How do I add an persistent column of row ids to Spark DataFrame?
                            
                                Pyspark: repartition vs partitionBy
                            
                                How to log using log4j to local file system inside a Spark application that runs on YARN?
                            
                                Perform a typed join in Scala with Spark Datasets
                            
                                Require kryo serialization in Spark (Scala)
                            
                                datetime range filter in PySpark SQL
                            
                                DataFrame / Dataset groupBy behaviour/optimization
                            
                                How to change memory per node for apache spark worker
                            
                                Change Executor Memory (and other configs) for Spark Shell
                            
                                How to convert List to JavaRDD

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Dealing with unbalanced datasets in Spark MLlib

Tags:

machine-learning

classification

apache-spark

apache-spark-mllib