Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Imbalance in scikit-learn

I'm using scikit-learn in my Python program in order to perform some machine-learning operations. The problem is that my data-set has severe imbalance issues.

Is anyone familiar with a solution for imbalance in scikit-learn or in python in general? In Java there's the SMOTE mechanizm. Is there something parallel in python?

like image 794
Maoritzio Avatar asked Feb 25 '13 11:02

Maoritzio


People also ask

What is imbalance in machine learning?

Imbalanced classification refers to a classification predictive modeling problem where the number of examples in the training dataset for each class label is not balanced. That is, where the class distribution is not equal or close to equal, and is instead biased or skewed.

What is imbalanced-learn?

Imbalanced-learn is a Python package used to handle imbalanced datasets in machine learning. In an imbalanced dataset, the number of data samples is not equally distributed between the classes. In an imbalanced dataset, the class labels are not equal.

What is imbalance in dataset?

A classification data set with skewed class proportions is called imbalanced. Classes that make up a large proportion of the data set are called majority classes. Those that make up a smaller proportion are minority classes.

How do you fix imbalanced classification?

One way to fight imbalance data is to generate new samples in the minority classes. The most naive strategy is to generate new samples by randomly sampling with replacement of the currently available samples.


2 Answers

There is a new one here

https://github.com/scikit-learn-contrib/imbalanced-learn

It contains many algorithms in the following categories, including SMOTE

  • Under-sampling the majority class(es).
  • Over-sampling the minority class.
  • Combining over- and under-sampling.
  • Create ensemble balanced sets.
like image 104
nos Avatar answered Oct 01 '22 06:10

nos


In Scikit learn there are some imbalance correction techniques, which vary according with which learning algorithm are you using.

Some one of them, like Svm or logistic regression, have the class_weight parameter. If you instantiate an SVC with this parameter set on 'balanced', it will weight each class example proportionally to the inverse of its frequency.

Unfortunately, there isn't a preprocessor tool with this purpose.

like image 36
Lucas Ribeiro Avatar answered Oct 01 '22 06:10

Lucas Ribeiro