Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Equal frequency discretization in R

Tags:

r

I'm having trouble finding a function in R that performs equal-frequency discretization. I stumbled on the 'infotheo' package, but after some testing I found that the algorithm is broken. 'dprep' seems to no longer be supported on CRAN.

EDIT :

For clarity, I do not need to seperate the values between the bins. I really want equal frequency, it doesn't matter if one value ends up in two bins. Eg :

c(1,3,2,1,2,2) 

should give a bin c(1,1,2) and one c(2,2,3)

like image 439
SFun28 Avatar asked Apr 20 '11 13:04

SFun28


3 Answers

EDIT : given your real goal, why don't you just do (corrected) :

 EqualFreq2 <- function(x,n){
    nx <- length(x)
    nrepl <- floor(nx/n)
    nplus <- sample(1:n,nx - nrepl*n)
    nrep <- rep(nrepl,n)
    nrep[nplus] <- nrepl+1
    x[order(x)] <- rep(seq.int(n),nrep)
    x
}

This returns a vector with indicators for which bin they are. But as some values might be present in both bins, you can't possibly define the bin limits. But you can do :

x <- rpois(50,5)
y <- EqualFreq2(x,15)
table(y)
split(x,y)

Original answer:

You can easily just use cut() for this :

EqualFreq <-function(x,n,include.lowest=TRUE,...){
    nx <- length(x)    
    id <- round(c(1,(1:(n-1))*(nx/n),nx))

    breaks <- sort(x)[id]
    if( sum(duplicated(breaks))>0 stop("n is too large.")

    cut(x,breaks,include.lowest=include.lowest,...)

}

Which gives :

set.seed(12345)
x <- rnorm(50)
table(EqualFreq(x,5))

 [-2.38,-0.886] (-0.886,-0.116]  (-0.116,0.586]   (0.586,0.937]     (0.937,2.2] 
             10              10              10              10              10 

x <- rpois(50,5)
table(EqualFreq(x,5))

 [1,3]  (3,5]  (5,6]  (6,7] (7,11] 
    10     13     11      6     10 

As you see, for discrete data an optimal equal binning is rather impossible in most cases, but this method gives you the best possible binning available.

like image 92
Joris Meys Avatar answered Nov 15 '22 03:11

Joris Meys


This sort of thing is also quite easily solved by using (abusing?) the conditioning plot infrastructure from lattice, in particular function co.intervals():

cutEqual <- function(x, n, include.lowest = TRUE, ...) {
    stopifnot(require(lattice))
    cut(x, co.intervals(x, n, 0)[c(1, (n+1):(n*2))], 
        include.lowest = include.lowest, ...)
}

Which reproduces @Joris' excellent answer:

> set.seed(12345)
> x <- rnorm(50)
> table(cutEqual(x, 5))

 [-2.38,-0.885] (-0.885,-0.115]  (-0.115,0.587]   (0.587,0.938]     (0.938,2.2] 
             10              10              10              10              10
> y <- rpois(50, 5)
> table(cutEqual(y, 5))

 [0.5,3.5]  (3.5,5.5]  (5.5,6.5]  (6.5,7.5] (7.5,11.5] 
        10         13         11          6         10

In the latter, discrete, case the breaks are different although they have the same effect; the same observations are in the same bins.

like image 5
Gavin Simpson Avatar answered Nov 15 '22 05:11

Gavin Simpson


How about?

a <- rnorm(50)
> table(Hmisc::cut2(a, m = 10))

[-2.2020,-0.7710) [-0.7710,-0.2352) [-0.2352, 0.0997) [ 0.0997, 0.9775) 
               10                10                10                10 
[ 0.9775, 2.5677] 
               10 
like image 5
Roman Luštrik Avatar answered Nov 15 '22 04:11

Roman Luštrik