I am trying to discretize a continuous variable, cutting it into three levels. I want to do the same thing for the log of the positive continuous variable (in this case, income).
require(dplyr)
set.seed(3)
mydata = data.frame(realinc = rexp(10000))
summary(mydata)
new = mydata %>%
select(realinc) %>%
mutate(logrealinc = log(realinc),
realincTercile = cut(realinc, 3),
logrealincTercile = cut(logrealinc, 3),
realincTercileNum = as.numeric(realincTercile),
logrealincTercileNum = as.numeric(logrealincTercile))
new[sample(1:nrow(new), 10),]
I would have thought that using cut()
would produce identical levels for the discretized factors of each of these variables (income and log income), because log is a monotone function. So the two columns on the right here should be equal, but that doesn't seem to happen. What's going on?
> new[sample(1:nrow(new), 10),]
realinc logrealinc realincTercile logrealincTercile realincTercileNum logrealincTercileNum
7931 0.2967813 -1.21475972 (-0.00805,2.83] (-4.43,-1.15] 1 2
9036 0.9511824 -0.05004944 (-0.00805,2.83] (-1.15,2.15] 1 3
8204 4.5365676 1.51217069 (2.83,5.66] (-1.15,2.15] 2 3
3136 2.0610693 0.72322490 (-0.00805,2.83] (-1.15,2.15] 1 3
9708 0.9655805 -0.03502581 (-0.00805,2.83] (-1.15,2.15] 1 3
5942 0.9149351 -0.08890215 (-0.00805,2.83] (-1.15,2.15] 1 3
4631 0.6987581 -0.35845064 (-0.00805,2.83] (-1.15,2.15] 1 3
7309 1.9532566 0.66949804 (-0.00805,2.83] (-1.15,2.15] 1 3
7708 0.4220254 -0.86268973 (-0.00805,2.83] (-1.15,2.15] 1 3
2965 1.3690976 0.31415186 (-0.00805,2.83] (-1.15,2.15] 1 3
Edit: @nicola's comment explains the source of the problem. It seems that in cut
's documentation, "equal-length intervals" refers to the length of the interval in the space of the continuous argument. I had originally interpreted "equal-length intervals" as meaning the number of elements assigned to each cut (on the output) would be equal (instead of the input).
Is there a function that does what I'm describing? -- where the number of elements in each output level are equal? Equivalently, where the levels of newfunc(realinc)
and newfunc(logrealinc)
are equal?
Discretization is the process through which we can transform continuous variables, models or functions into a discrete form. We do this by creating a set of contiguous intervals (or bins) that go across the range of our desired variable/model/function. Continuous data is Measured, while Discrete data is Counted.
the act or process of making mathematically discrete. the process of dividing a geometry into finite elements to prepare for analysis.
We can use NumPy's digitize() function to discretize the quantitative variable. Let us consider a simple binning, where we use 50 as threshold to bin our data into two categories. One with values less than 50 are in the 0 category and the ones above 50 are in the 1 category.
Discretization is typically used as a pre-processing step for machine learning algorithms that handle only discrete data.
If you want your levels to be equally populated, take a look at the quantile
function. Try for instance:
x<-cut(new$realinc,quantile(new$realinc,0:3/3))
y<-cut(new$logrealinc,quantile(new$logrealinc,0:3/3))
all(as.integer(x)==as.integer(y),na.rm=TRUE)
#[1] TRUE
table(x)
#x
#(0.000444,0.396] (0.396,1.12] (1.12,8.49]
# 3333 3333 3333
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With