Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

sklearn logistic regression with unbalanced classes

Tags:

I'm solving a classification problem with sklearn's logistic regression in python.

My problem is a general/generic one. I have a dataset with two classes/result (positive/negative or 1/0), but the set is highly unbalanced. There are ~5% positives and ~95% negatives.

I know there are a number of ways to deal with an unbalanced problem like this, but have not found a good explanation of how to implement properly using the sklearn package.

What I've done thus far is to build a balanced training set by selecting entries with a positive outcome and an equal number of randomly selected negative entries. I can then train the model to this set, but I'm stuck with how to modify the model to then work on the original unbalanced population/set.

What are the specific steps to do this? I've poured over the sklearn documentation and examples and haven't found a good explanation.

like image 632
agentscully Avatar asked Feb 13 '13 21:02

agentscully


People also ask

Can logistic regression be used for an imbalanced classification problem?

Logistic regression does not support imbalanced classification directly. Instead, the training algorithm used to fit the logistic regression model must be modified to take the skewed distribution into account.

How do you deal with unbalanced datasets in logistic regression?

In logistic regression, another technique comes handy to work with imbalance distribution. This is to use class-weights in accordance with the class distribution. Class-weights is the extent to which the algorithm is punished for any wrong prediction of that class.

Do you need balanced data for logistic regression?

Logistic regression requires dependent variable which is in binary form i.e., 0 and 1. A balanced sample means if you have thirty 0, you also need thirty 1. But, there is no such condition in logistic regression.


1 Answers

Have you tried to pass to your class_weight="auto" classifier? Not all classifiers in sklearn support this, but some do. Check the docstrings.

Also you can rebalance your dataset by randomly dropping negative examples and / or over-sampling positive examples (+ potentially adding some slight gaussian feature noise).

like image 187
ogrisel Avatar answered Sep 25 '22 09:09

ogrisel