How does the class_weight parameter in scikit-learn work?

People also ask

What does Class_weight balanced do?

Balanced class weights can be automatically calculated within the sample weight function. Set class_weight = 'balanced' to automatically adjust weights inversely proportional to class frequencies in the input data (as shown in the above table).

What is Class_weight in logistic regression?

The LogisticRegression class provides the class_weight argument that can be specified as a model hyperparameter. The class_weight is a dictionary that defines each class label (e.g. 0 and 1) and the weighting to apply in the calculation of the negative log likelihood when fitting the model.

How does class weighting work?

With class weighting enabled, the sum is replaced by a weighted sum instead so that each sample contributes to the loss proportionally to the sample's class weight. The Peltarion Platform assigns class weights, which are inversely proportional to the class frequencies in the training data.

How does class weight work in random forest?

Random Forest With Class Weighting Impurity measures how mixed the groups of samples are for a given split in the training dataset and is typically measured with Gini or entropy.

First off, it might not be good to just go by recall alone. You can simply achieve a recall of 100% by classifying everything as the positive class. I usually suggest using AUC for selecting parameters, and then finding a threshold for the operating point (say a given precision level) that you are interested in.

For how class_weight works: It penalizes mistakes in samples of class[i] with class_weight[i] instead of 1. So higher class-weight means you want to put more emphasis on a class. From what you say it seems class 0 is 19 times more frequent than class 1. So you should increase the class_weight of class 1 relative to class 0, say {0:.1, 1:.9}. If the class_weight doesn't sum to 1, it will basically change the regularization parameter.

For how class_weight="auto" works, you can have a look at this discussion. In the dev version you can use class_weight="balanced", which is easier to understand: it basically means replicating the smaller class until you have as many samples as in the larger one, but in an implicit way.

The first answer is good for understanding how it works. But I wanted to understand how I should be using it in practice.

SUMMARY

for moderately imbalanced data WITHOUT noise, there is not much of a difference in applying class weights
for moderately imbalanced data WITH noise and strongly imbalanced, it is better to apply class weights
param class_weight="balanced" works decent in the absence of you wanting to optimize manually
with class_weight="balanced" you capture more true events (higher TRUE recall) but also you are more likely to get false alerts (lower TRUE precision)
- as a result, the total % TRUE might be higher than actual because of all the false positives
- AUC might misguide you here if the false alarms are an issue
no need to change decision threshold to the imbalance %, even for strong imbalance, ok to keep 0.5 (or somewhere around that depending on what you need)

The result might differ when using RF or GBM. sklearn does not have class_weight="balanced" for GBM but lightgbm has LGBMClassifier(is_unbalance=False)

CODE

# scikit-learn==0.21.3
from sklearn import datasets
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score, classification_report
import numpy as np
import pandas as pd

# case: moderate imbalance
X, y = datasets.make_classification(n_samples=50*15, n_features=5, n_informative=2, n_redundant=0, random_state=1, weights=[0.8]) #,flip_y=0.1,class_sep=0.5)
np.mean(y) # 0.2

LogisticRegression(C=1e9).fit(X,y).predict(X).mean() # 0.184
(LogisticRegression(C=1e9).fit(X,y).predict_proba(X)[:,1]>0.5).mean() # 0.184 => same as first
LogisticRegression(C=1e9,class_weight={0:0.5,1:0.5}).fit(X,y).predict(X).mean() # 0.184 => same as first
LogisticRegression(C=1e9,class_weight={0:2,1:8}).fit(X,y).predict(X).mean() # 0.296 => seems to make things worse?
LogisticRegression(C=1e9,class_weight="balanced").fit(X,y).predict(X).mean() # 0.292 => seems to make things worse?

roc_auc_score(y,LogisticRegression(C=1e9).fit(X,y).predict(X)) # 0.83
roc_auc_score(y,LogisticRegression(C=1e9,class_weight={0:2,1:8}).fit(X,y).predict(X)) # 0.86 => about the same
roc_auc_score(y,LogisticRegression(C=1e9,class_weight="balanced").fit(X,y).predict(X)) # 0.86 => about the same

# case: strong imbalance
X, y = datasets.make_classification(n_samples=50*15, n_features=5, n_informative=2, n_redundant=0, random_state=1, weights=[0.95])
np.mean(y) # 0.06

LogisticRegression(C=1e9).fit(X,y).predict(X).mean() # 0.02
(LogisticRegression(C=1e9).fit(X,y).predict_proba(X)[:,1]>0.5).mean() # 0.02 => same as first
LogisticRegression(C=1e9,class_weight={0:0.5,1:0.5}).fit(X,y).predict(X).mean() # 0.02 => same as first
LogisticRegression(C=1e9,class_weight={0:1,1:20}).fit(X,y).predict(X).mean() # 0.25 => huh??
LogisticRegression(C=1e9,class_weight="balanced").fit(X,y).predict(X).mean() # 0.22 => huh??
(LogisticRegression(C=1e9,class_weight="balanced").fit(X,y).predict_proba(X)[:,1]>0.5).mean() # same as last

roc_auc_score(y,LogisticRegression(C=1e9).fit(X,y).predict(X)) # 0.64
roc_auc_score(y,LogisticRegression(C=1e9,class_weight={0:1,1:20}).fit(X,y).predict(X)) # 0.84 => much better
roc_auc_score(y,LogisticRegression(C=1e9,class_weight="balanced").fit(X,y).predict(X)) # 0.85 => similar to manual
roc_auc_score(y,(LogisticRegression(C=1e9,class_weight="balanced").fit(X,y).predict_proba(X)[:,1]>0.5).astype(int)) # same as last

print(classification_report(y,LogisticRegression(C=1e9).fit(X,y).predict(X)))
pd.crosstab(y,LogisticRegression(C=1e9).fit(X,y).predict(X),margins=True)
pd.crosstab(y,LogisticRegression(C=1e9).fit(X,y).predict(X),margins=True,normalize='index') # few prediced TRUE with only 28% TRUE recall and 86% TRUE precision so 6%*28%~=2%

print(classification_report(y,LogisticRegression(C=1e9,class_weight="balanced").fit(X,y).predict(X)))
pd.crosstab(y,LogisticRegression(C=1e9,class_weight="balanced").fit(X,y).predict(X),margins=True)
pd.crosstab(y,LogisticRegression(C=1e9,class_weight="balanced").fit(X,y).predict(X),margins=True,normalize='index') # 88% TRUE recall but also lot of false positives with only 23% TRUE precision, making total predicted % TRUE > actual % TRUE

Related questions
                            
                                How to correctly sort a string with a number inside? [duplicate]
                            
                                How to customize a requirements.txt for multiple environments?
                            
                                How can I open the interactive matplotlib window in IPython notebook?
                            
                                Get the first element of each tuple in a list in Python [duplicate]
                            
                                How to handle exceptions in a list comprehensions?
                            
                                How to write to an existing excel file without overwriting data (using pandas)?
                            
                                How to read a file in reverse order?
                            
                                How do I write a Python dictionary to a csv file? [duplicate]
                            
                                Properties file in python (similar to Java Properties)
                            
                                Remove all line breaks from a long string of text
                            
                                How can I use Homebrew to install both Python 2 and 3 on Mac?
                            
                                Check list of words in another string [duplicate]
                            
                                best way to preserve numpy arrays on disk
                            
                                How to enable CORS in flask
                            
                                How can I create a simple message box in Python?
                            
                                What exactly does += do in python?
                            
                                python multithreading wait till all threads finished
                            
                                How do I design a class in Python?
                            
                                SQLAlchemy: cascade delete
                            
                                Can't connect to local MySQL server through socket '/tmp/mysql.sock

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How does the class_weight parameter in scikit-learn work?

Tags:

python

scikit-learn

People also ask

Recent Activity

Donate For Us