Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Discretizing the log of a continuous variable

I am trying to discretize a continuous variable, cutting it into three levels. I want to do the same thing for the log of the positive continuous variable (in this case, income).

require(dplyr)
set.seed(3)
mydata = data.frame(realinc = rexp(10000))

summary(mydata)

new = mydata %>% 
  select(realinc) %>%
  mutate(logrealinc = log(realinc),
         realincTercile = cut(realinc, 3),
         logrealincTercile = cut(logrealinc, 3),
         realincTercileNum = as.numeric(realincTercile),
         logrealincTercileNum = as.numeric(logrealincTercile)) 

new[sample(1:nrow(new), 10),]

I would have thought that using cut() would produce identical levels for the discretized factors of each of these variables (income and log income), because log is a monotone function. So the two columns on the right here should be equal, but that doesn't seem to happen. What's going on?

> new[sample(1:nrow(new), 10),]
       realinc  logrealinc  realincTercile logrealincTercile realincTercileNum logrealincTercileNum
7931 0.2967813 -1.21475972 (-0.00805,2.83]     (-4.43,-1.15]                 1                    2
9036 0.9511824 -0.05004944 (-0.00805,2.83]      (-1.15,2.15]                 1                    3
8204 4.5365676  1.51217069     (2.83,5.66]      (-1.15,2.15]                 2                    3
3136 2.0610693  0.72322490 (-0.00805,2.83]      (-1.15,2.15]                 1                    3
9708 0.9655805 -0.03502581 (-0.00805,2.83]      (-1.15,2.15]                 1                    3
5942 0.9149351 -0.08890215 (-0.00805,2.83]      (-1.15,2.15]                 1                    3
4631 0.6987581 -0.35845064 (-0.00805,2.83]      (-1.15,2.15]                 1                    3
7309 1.9532566  0.66949804 (-0.00805,2.83]      (-1.15,2.15]                 1                    3
7708 0.4220254 -0.86268973 (-0.00805,2.83]      (-1.15,2.15]                 1                    3
2965 1.3690976  0.31415186 (-0.00805,2.83]      (-1.15,2.15]                 1                    3

Edit: @nicola's comment explains the source of the problem. It seems that in cut's documentation, "equal-length intervals" refers to the length of the interval in the space of the continuous argument. I had originally interpreted "equal-length intervals" as meaning the number of elements assigned to each cut (on the output) would be equal (instead of the input).

Is there a function that does what I'm describing? -- where the number of elements in each output level are equal? Equivalently, where the levels of newfunc(realinc) and newfunc(logrealinc) are equal?

like image 284
Hatshepsut Avatar asked Apr 13 '16 04:04

Hatshepsut


People also ask

How do you discretize continuous data?

Discretization is the process through which we can transform continuous variables, models or functions into a discrete form. We do this by creating a set of contiguous intervals (or bins) that go across the range of our desired variable/model/function. Continuous data is Measured, while Discrete data is Counted.

What does Discretizing mean?

the act or process of making mathematically discrete. the process of dividing a geometry into finite elements to prepare for analysis.

How do you discretize continuous data in Python?

We can use NumPy's digitize() function to discretize the quantitative variable. Let us consider a simple binning, where we use 50 as threshold to bin our data into two categories. One with values less than 50 are in the 0 category and the ones above 50 are in the 1 category.

When should you discretize data?

Discretization is typically used as a pre-processing step for machine learning algorithms that handle only discrete data.


1 Answers

If you want your levels to be equally populated, take a look at the quantile function. Try for instance:

x<-cut(new$realinc,quantile(new$realinc,0:3/3))
y<-cut(new$logrealinc,quantile(new$logrealinc,0:3/3))
all(as.integer(x)==as.integer(y),na.rm=TRUE)
#[1] TRUE
table(x)
#x
#(0.000444,0.396]     (0.396,1.12]      (1.12,8.49] 
#            3333             3333             3333
like image 56
nicola Avatar answered Oct 07 '22 03:10

nicola