Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

binning continuous variables by IV value in R

Tags:

r

I am building a logistic regression model in R. I want to bin continuous predictors in an optimal way in relationship to the target variable. There are two things that I know of:

  1. the continuous variables are binned such that its IV (information value) is maximized

  2. maximize the chi-square in the two way contingency table -- the target has two values 0 and 1, and the binned continuous variable has the binned buckets

Does anyone know of any functions in R that can perform such binning?

Your help will be greatly appreciated.

like image 749
Michael Avatar asked Aug 10 '11 23:08

Michael


1 Answers

For the first point, you could bin using the weight of evidence (woe) with the package woebinning which optimizes the number of bins for the IV

library(woeBinning)

# get the bin cut points from your dataframe
cutpoints <- woe.binning(dataset, "target_name", "Variable_name")
woe.binning.plot(cutpoints)

# apply the cutpoints to your dataframe
dataset_woe <- woe.binning.deploy(dataset, cutpoint, add.woe.or.dum.var = "woe")

It returns your dataset with two extra columns

  • Variable_name.binned which is the labels
  • Variable_name.woe.binned which is the replaced values that you can then parse into your regression instead of Variable_name

For the second point, on chi2, the package discretization seems to handle it but I haven't tested it.

like image 98
R. Prost Avatar answered Sep 29 '22 06:09

R. Prost